Genome Analysis of 10K SARS-COV-2 Sequences to Identify the Presence of Single-Nucleotide Polymorphisms

. A new type of coronavirus was identified in Wuhan, China, in December 2019, which was named SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus-2). The high mutation rate of SARS-CoV-2 makes it challenging to develop effective vaccines for all variants. Substitution is the most common type of mutation that occurs in SARS-CoV-2. This research was conducted to identify the genetic variability of mutations in SNP of SARS-CoV-2 and analyse the impact. About 15,000 sequences of SARS-CoV-2 were downloaded from GISAID, which were isolated from 33 different countries around the world from February 2020 to July 2021. Sequence analysis was done using the MAFFT and the Nextclade. The results of this study are expected to help identify conserved regions in SARS-CoV-2 which can be used as probes for the virus identification process and can be used as target areas in vaccine development. Furthermore the results showed that the most common variants were variants 20B, 20A, and 20I (Alpha), with a population percentage of 32.12%, 23.95% and 17.39% of the total population, respectively. Furthermore, SNPs were called in the samples using the SNP-sites and extracted using Excel. Of the 10,107 sequences of SARS-CoV-2 studied, 154 SNPs were found with the highest number of SNPs in the spike, nsp3 and nucleocapsid genes. The ratio of the number of mutations to the most extensive sequence length was in the ORF8, ORF7a, and ORF7b genes with respective values of 0.537, 0.474, and 0.419


Introduction
A new type of coronavirus was identified in Wuhan, China, in December 2019.The virus has caused the COVID-19 outbreak which was declared a pandemic by WHO on March 11 2020.SARS-CoV-2 is the name given to the virus which comes from its taxonomic relationship with the coronavirus that caused the SARS outbreak [1].Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) is a virus that causes severe acute respiratory syndrome [2].Furthermore symptoms of COVID-19 include fever, cough and shortness of breath [3].In more severe cases, the infection can cause pneumonia, kidney failure and eventually death [4].Every day thousands of new cases are revealed.According to the records of July 16 2021, GISAID recorded that globally more than 189 million people around the world were affected by this deadly virus, with more than 4 million deaths [5].
Coronavirus belongs to the order Nidovirales and the Coronaviridae family.Coronaviruses are classified into 4 genera namely, Alphacoronavirus, Betacoronavirus, Deltacoronavirus, and Gammacoronavirus.SARS-CoV-2 belongs to the Betacoronavirus genus and the Sarbecovirus sub-genus [6,1].Based on genome analysis, SARS-CoV-2 belongs to an RNA virus.RNA viruses *Corresponding author: nugrahapraja@sith.itb.ac.id have a higher mutation rate than their hosts [7].Mutations occur due to an error when the viral genetic material is replicated so that one or more of the nucleotide bases change.Mutations are so common that, in some cases, they do not cause significant changes in the organism.However, the high mutation rate in RNA viruses makes it more likely to create new variants more quickly.Genome analysis of SARS-CoV-2 over the last 1 year shows a nucleotide substitution rate in SARS-CoV-2 of ~1 × 10-3 substitutions per year [8].
This number is comparable to the substitution rate in Ebola virus, which is 1.42 × 10-3 [9].Even so, this number is still slightly lower than the mutation rate in SARS-CoV, which is 0.8 -2.38 × 10-3 [10].Currently, most of the treatment therapy and diagnosis of COVID-19 are still based on the genome sequence isolated in Wuhan at the start of the pandemic.With the many mutations that have occurred so far, the detection of the SARS-CoV-2 virus and the efficiency of antiviral drugs can be affected by variations and changes in viral phenotypes [11,12].Single-Nucleotide Polymorphism or SNP (pronounced snip) is a type of point mutation.Mutations that can be categorized as SNP are point mutations that have been found in more than 1% of the population [13].Often, these SNPs do not have a significant impact on gene expression, but some SNPs on BIO Web of Conferences 75, 01005 (2023) https://doi.org/10.1051/bioconf/20237501005BioMIC 2023 both coding and non-coding proteins can cause changes in protein structure, protein function, and regulation of protein expression [14].In addition, SNPs can change the virulence, pathogenicity, and immunogenicity of viruses [15].
Several studies have shown a tendency to increase the transmissibility of each variant against the variants that existed before.The Alpha variant is estimated to have a higher transmissibility rate of 43-90% than the SARS-CoV-2 that appeared early in the pandemic [16].The Beta variant is estimated to have a transmission rate of 1.5 times higher than the previous variant [17].Research conducted in Brazil showed that the gamma variant has an increased transmissibility of 1.7-2.4times compared to the previous variant [18].Other studies have shown an increase in transmissibility in the Delta variant by 1.64 times that of the Alpha variant [19].With the continued increase in the transmission rate of COVID-19, further studies are needed to continue to monitor the progress of this virus.Studies show that in a relatively short time, this virus is capable of mutating into various variants [11].A study showed that mutations in the E (C26340T) and N (C29200T) genes affected the detection of target genes by 2 assay methods in 8 and 1 patient respectively [20].Both mutations are transitions from C>T which is a common type of SNP and are related to the mechanism of mRNA editing in the host which is known as the apoliprotein B mRNA-editing enzyme [21,22].Another study found that the G>U transition at position 29140 affects the sensitivity of N gene-based detection [23].
Several countries have started promoting vaccine activities for their citizens, including Indonesia.There are various types of vaccines currently available.However, the emergence of SNPs in SARS-CoV-2 in large quantities can reduce the efficiency of the vaccine and can escape the current detection of COVID-19 [24].Research conducted by Nasreen et al (2021), shows that there are differences in the effectiveness of the Pfizer, Moderna, and Astrazeneca vaccines against the COVID-19 Alpha, Beta, Gamma, and Delta variants.However, this still needs further research to find out the causes of differences in the effectiveness of these vaccines [25].
Research on SNP has been quite a lot done.However, SARS-CoV-2 still exists and continues to evolve.Therefore, studies on SNP need to be continued to monitor the development of this virus.Previous research was conducted by Yuan et al (2020), found 119 SNPs had occurred from 11,183 sequences of SARS-CoV-2.Among them, there are 74 non-synonymous mutations, 43 synonymous mutations, and 2 mutations in the intergenic section.The results of this study also showed a high frequency of mutations in the nsp2, nsp3, and Spike protein genes [26].Another study was conducted with a sample of 714 SARS-CoV-2 sequences and found a total of 108 SNPs.The study also showed that the highest number of non-synonymous mutations occurred in the nsp3 gene, nucleocapsid, and spike protein.Of the 108 SNPs, 100 SNPs are found in the coding region and 35 of them are synonymous mutations [27].Another study conducted on 10,664 SARS-CoV-2 sequences from 73 countries found 107 SNPs.Based on this study, 5 SNPs are predicted to have a harmful impact, namely mutations T85I, Q57H, R203M, F506L, and S507C [28].In addition, virus detection and epitope-based synthetic vaccine design require large amounts of genome analysis to find the most conserved parts so that global RT-PCR methods can be developed and used worldwide [29].
Identification of the SARS-CoV-2 genome characters such as selection patterns is necessary because the virus continues to evolve.This knowledge is important in the diagnosis and control of disease [30,31].This research was conducted to identify the genetic variability of mutations in the form of Single-nucleotide polymorphism (SNP).The results of this study are expected to help deepen knowledge about the characteristics of SARS-CoV-2.In this study, SNP analysis was carried out using the short and accurate Multiple Sequence Alignment method using MAFFT [32], then SNP extraction was carried out using the SNP-sites program which is capable of performing SNP calling at high-speed multi-FASTA alignment [33].

Sample Data-Mining
The study was conducted in silico on a web server and local computer using the SARS-CoV-2 sequence data downloaded from GISAID with reference sequences taken from NCBI.The MAFFT web server version 7.481 is used with the -addfragments package [34].SNP-sites software is also used [33] to perform SNP calling.In the search for variants, the Nextstrain web server was used with the Nextclade tools [35].Data processing to extract SNP was carried out using Microsoft Excel.Excel is also used to visualize data in the form of graphs, tables and diagrams.
A total of 15,000 sequences of SARS-CoV-2 from 35 different countries around the world were downloaded from GISAID [5] on 12 July 2021.The time span for isolation of these sequences was from February 2020 to 12 July 2021.The selected sequences have high and full genome coverage, and is a complete genome with an average sequence length of 29.8 bp.In addition, sequences with low coverage are also excluded.As a reference, the complete sequence was taken from NCBI (National Center for Biotechnology Information) with accession number NC_045512.2which was isolated from Wuhan, China.The sequences are also used to map samples.The selected sample has been isolated from the following countries, namely; Indonesia, India, Singapore, Japan, Saudi Arabia, China, Russia, Malaysia, United States, Canada, Costa Rica, Mexico, Belgium, Spain, France, Italy, Turkey, Germany, Australia, Colombia, Peru, Brazil, Kenya, Reunion, South Africa, Ghana, Madagascar, Malawi, Mali, Mauritius, Mayotte, Morocco and Mozambique.

Sample Processing and Alignment
Next, the samples were processed with BioEdit to be aggregated and sorted.Samples that have a sequence fragment of 'NNNNNN' in the sequence are deleted.In addition, sequences containing other than A, C, T, and G were deleted due to low quality.In this process, as many as 5000 sequences were deleted due to poor quality, leaving 10,107 sequences to go to the next stage.
Furthermore processed samples need to be aligned and mapped to the reference sequence.Multiple Sequence Alignment was performed to align the samples.This process is carried out using the MAFFT web-server (version 7.481) using default parameter [32].The method used is progressive alignment using the -add fragments package contained in [34].This method was chosen because the samples used came from the same species, so to speed up the alignment process each sequence was only aligned with the reference genome to form the entire MSA [34].This process also simultaneously maps the sample against the reference sequence.After that, a consensus sequence is obtained to be able to extract point mutations in the sample.The output from the MSA will be in the FASTA format.

SNP-sites Calling
To be able to find mutation points, Single-nucleotide polymorphism was called on samples using SNP-sites with default parameter [16].SNP-sites is a program that can call SNP on samples in the form of multi-fasta alignment.The resulting output can be a file with VCF or PHYLIP format.In this study, the selected output has a VCF format so that genome analysis can be carried out.

SARS-COV-2 Variant Identification
Furthermore, the sequences were processed with the Nextclade tool version 1.5.4 on the Nextstrain web server with default parameter [35] to identify the variants of each sequence.Nextclade is a tool for sequence analysis that is able to identify variants of SARS-CoV-2 based on the Nextstrain classification of sequences uploaded by users.
After obtaining the point of occurrence of substitution mutations, the VCF file is imported and opened using Microsoft Excel for further processing.Microsoft Excel is used to filter mutations to determine SNP, as well as to visualize research results.In this study, substitutions that occur in more than 1% of the sample are considered Single-nucleotide polymorphism, so these substitutions must occur in at least 101 samples.

Datamine Collection
This study was conducted to find SNPs in 10,107 SARS-CoV-2 genome sequences.Samples taken from GISAID were isolated from February to July 2021.The samples selected are complete genomes and have high coverage.A total of 15,911 samples were downloaded.The samples were isolated from 35 different countries.
Then, the samples were sorted using BioEdit to remove sequences containing letters other than A, C, T, and G.In this process, sequences containing the 'NNNN' fragment will also be deleted.In this process, more than 5000 sequences were deleted leaving 10,107 sequences.A list of all the countries in the 10,107 sequences can be found in Table 1.It can also be seen that there is an uneven distribution of the sample.There are several countries with very few samples, especially in Africa, namely Madagascar, Malawi.These countries do not upload many samples of SARS-CoV-2.Most of the samples representing the African continent were taken from South Africa.In addition, there are also a small number of sequences from the Oceania continent.And, only 3 countries can be taken from Oceania, namely Australia, New Zealand, and Guam.However, due to poor quality, samples from New Zealand were erased during processing with BioEdit.Few samples were taken from Oceania because there were relatively few cases in Australia and New Zealand, and other Oceania countries did not upload many samples to GISAID.

The Results of the Identification of Sample Variants
Sequence analysis using Nextclade was performed to determine the variance of each sequence.Figure 1. is the result of data processing obtained from Nextclade.Based on the figure, it can be seen that the highest number of variants is variant 20B with a percentage of 32.12% of the total population.Followed by the high number of 20A variants, namely 23.95% and 20I (Alpha) variants of 17.39% of the total population.These three variants alone dominate more than 50% of the SARS-CoV-2 variants in the world.Other variants that have the lowest prevalence percentage include 21G (Lambda), 21H, and 21D (Eta) with respective percentages of 0.07%, 0.03% and 0.02%.Even so, the data used in this study is the sequence of SARS-CoV-2 isolated from February 2020 to July 2021.According to data from Nextstrain in August 2021, the currently most common variant is variant 20I (Alpha) [35].So, currently there is a possibility that there is a shift from the previously most common variant, namely the 20B variant, to the 20I (Alpha) variant.This indicates that SARS-CoV-2 has adapted and produced mutations that are beneficial to the SARS-CoV-2 variant 20I (Alpha), so that its number increases.

Single-Nucleotide Polymorphism in the SARS-CoV-2 genome sequence
A total of 10,107 samples were processed with the SNP-Sites program to obtain their SNP positions.The regions analyzed were only the translated parts of the genome, namely at the nucleotide base positions 266-29676, while the remaining parts were 3'-UTR and 5'-UTR (untranslated region).From the results of data processing, a total of 154 SNPs were obtained.Based on Figure 2., it can be seen that mutations in the genome positions C14408T, C3037T, and A23403G occurred in more than 8000 samples.In addition, there is a high mutation rate, namely more than 5000 sample populations at sequential genome positions, namely G28881A, G28882A, and G28883C.
The C14408T mutation occurs in the nsp12 gene or it is also called RNA-dependent RNA polymerase (RdRp).Based on the research results in Figure 1., this mutation has the highest prevalence among other mutations.RdRp in SARS-CoV-2 plays an important role in the process of viral replication and transcription.The RdRp gene of SARS-CoV-2 has a high level of homology with SARS-CoV and is a conserved part, this shows that both of them can have the same mechanism [36,37].RdRp works with a complex mechanism with nsp7 and nsp8 in carrying out replication and transcription processes.RdRp also has a proofreading function, so mutations in RdRp can result in the emergence of new mutations or even an increase in the mutation rate [37].Several studies have demonstrated high mutation rates following the C14408T mutation.This needs to be considered because currently there are various antiviral drugs that target RdRp.The C14408T mutation is among the conserved areas.Therefore, further studies are needed to see whether these mutations can affect the efficiency of these drugs [37].According to Wang, et al (2021), the C14408T mutation is related to the A23403G mutation, which is a mutation with a high prevalence rate.The increasing number of C14408T mutations is also in line with the increasing number of COVID-19 patients which indicates a link between the C14408T mutation and the transmission of SARS-CoV-2 [38].
The A23403G mutation occurs in the spike protein.This mutation resulted in the D614G mutation which is located in a conserved part of this species [39].Mutation A23403G is also one of the key mutations in variants 20A and 20I (Alpha), which are the 2 variants with the highest number in the sample population.The high frequency of the A23403G mutation indicates that this mutation has a beneficial effect on the SARS-CoV-2 virus.This is in line with research conducted by Wang et al (2021), this study states that there has been an increase in the number of A23403G mutations in the SARS-CoV-2 virus isolated in the United States over time.This increase in the number of mutations coincides with a sharp increase in COVID-19 cases in the United States [38].Several studies have also carried out molecular docking of this mutation and have found that the A23403G mutation causes changes in the structure of the resulting protein resulting in a more infectious variant of SARS-CoV-2 [12,40].The next highest mutations were 3 consecutive mutations in the gene encoding the nucleocapsid (N) protein, namely mutations G28881A, G28882A, and G28883C.The nucleocapsid is included in the structural protein in SARS-CoV-2 which plays an important role in RNA packaging, release of viral particles, and formation of the ribonucleoprotein core [41].If these mutations occur simultaneously, it results in a change in the nucleotide base arrangement of GGG to AAC.This nucleotide change is quite significant.However, further research is needed to prove that this mutation can cause structural changes in the resulting protein.
The C3037T mutation is located in the nsp3 gene.The high population of variants 20A and 20I (Alpha) contributed to the large number of C3037T mutations.The C3037T mutation is one of the key mutations in the two variants.According to molecular analysis conducted by Yuan et al (2020), these mutations are synonymous which do not cause changes to the resulting amino acid sequence [26].
Figure 3 shows the number of SNPs in each gene.Based on the figure, the spike, nsp3, and nucleocapsid genes have the highest number of SNPs compared to the other genes.

Mutations in SARS-CoV-2
The highest number of SNPs were in the spike, nucleocapsid, and nsp3 genes.However, from these results it is not yet known whether these genes do have high variability or are influenced by other factors Fig. 4. Graph of the number of mutations in each gene Figure 4. shows the number of mutations in each gene.In the figure it can be seen that the nsp3, spike, and nsp2 genes have the highest number of mutations.This is in accordance with previous studies which also found that the highest number of mutations occurred in these three genes [26].However, these genes have different sizes.Longer genes tend to have a higher number of mutations than short genes.Thus, these results need to be normalized by the length of the sequence of each gene.
Based on Table 2, the ratio of the number of mutations to the relatively high sequence length occurred in the ORF8, ORF7a and ORF7b genes with respective values of 0.537, 0.474, and 0.419.This number is higher when compared to the ratio of the number of mutations to the length of the sequence in the nsp3 and nsp2 genes, which were only 0.197 and 0.252.An evolutionary study conducted by Pereira (2020) shows that among other accessory genes, ORF8 is the gene that has the highest level of variation.However, the function of ORF8 is still  not fully known, so it cannot be concluded whether this variation has an effect on SARS-CoV-2 [42].The parts with the lowest mutations were the nsp8 and RdRp genes with mutation ratios and sequence lengths of 0.141 and 0.143.This is related to the function of the RdRp gene itself which is important for the process of viral replication.Thus, the stability of these genes is necessary for viruses [37].In addition, the M (membrane) and E (envelope) genes also have a relatively low ratio of the number of mutations to the length of the sequence among other structural proteins with a ratio of 0.178 and 0.184 respectively.The low level of variability of the M and E proteins indicates that these two genes tend to be more stable than the other genes, as well as the link between these two genes with housekeeping functions [43].The following Table 2 shows the ratio of mutations to the length of the sequence.

Subtitutions in the SARS-CoV-2 genome
Of the 154 SNPs obtained in this study from Figure 5, 80 SNPs were associated with C>T substitutions which represented most of the transitions.Table 3 shows the C>T substitution that occurs in SARS-CoV-2.This substitution can be caused by several factors, namely the factor of RNA deamination and the cost of biosynthesis of the nucleotide base itself.In synonymous mutations, the C>T substitution may be caused by deamination of the host's RNA, resulting in a C>T or A>G substitution.This is because humans and various animal and plant species have adenosine-inosine and cytidine-uridine deamination mechanisms as RNA diversification steps in cells that cause mismatches in the viral replication process [44], [45].Several studies have also demonstrated high levels of deaminated RNA in the sequences of SARS-CoV-2 [46].This factor is one of the strong factors for the high C>T mutation.This is also supported by the number of A>G mutations which is the second most common mutation after C>T mutations.Even so, C>T mutations are still far more numerous than A>G mutations.

Fig. 5. Graph of the number of mutations in each type of substitution
Apart from these factors, another factor is the cost of nucleotide base biosynthesis.Thymine biosynthesis has a lower cost than cytosine, where thymine requires less ATP in its biosynthetic process.The biosynthesis of nucleotide bases requires a number of ATP molecules, the amount varies between nucleotide bases in the following order A > G > C > T. The biosynthesis of thymine is the lowest compared to other nucleotide bases [47].Biosynthesis that requires less ATP is preferred in the process of natural selection [45].Cost reduction in the biosynthetic process can be one of the causes of high C>T mutations.
This RNA deamination does not only occur in SARS-CoV-2, but also in various other viruses that attack animals and humans.This mutation pattern was also observed in other viruses namely Bat RaTG13 and other coronaviruses.Among other betacoronaviruses (SARS and MERS), SARS-CoV-2 experienced the most extreme RNA deamination [46,48].Additional details on the containing mutations which are C>T mutations can be found in the following Table 3.It is concluded that out of 10,107 samples of SARS-CoV-2 studied, 154 SNPs were found.The genes with the highest number of SNPs were the spike, nsp3, and nucleocapsid genes.To determine the variability of each gene, the ratio of the number of mutations to the length of the sequence is used.The mutation ratios with the largest sequence lengths were in the ORF8, ORF7a, and ORF7b genes with respective values of 0.537, 0.474, and 0.419.Based on these results, the high number of mutations and SNPs in a gene does not necessarily reflect the level of variability of that gene.This can be seen from how the spike, nsp3, and spike genes have a high number of mutations, but when normalized by the sequence length, the number is relatively not too high compared to the accessory proteins.Therefore these results indicate that the high SNP in a gene cannot be used as a benchmark for whether or not the mutation rate of that gene is high.

Fig. 1 .
Fig. 1.Circle diagram of the percentage of sample variance in the sample population according to the Nextstrain classification.

Table 1 .
Profile of sample countries before and after sorting using BioEdit

Table 2 .
The ratio of the number of mutations to the length of the sequence of each gene

Table 3 .
Mutations with substitution type C>T Nucleocapsid 4 Conclusions