GEMS

Genetic variation refers to the difference in DNA level between different individuals in the same gene pool based on heredity, also known as "molecular variation", which is also a qualitative or quantitative description of genetic differences between individuals of the same species.

Single Nucleotide Polymorphism (SNP) mainly refers to the DNA sequence polymorphism caused by the variation of a single nucleotide at the genome level. It is the most common human heritable variation, accounting for more than 90% of all known polymorphisms. SNPs exist widely in the human genome, with an average of 1 in every 300 base pairs, and the total number is estimated to be 3 million or more. SNP is a two-state marker, which is caused by the conversion or reversal of a single base, or by the insertion or deletion of a base. SNP may be in the gene sequence or the non-coding sequence outside the gene.

Copy Number Variation (CNV) is caused by the rearrangement of the genome. It generally refers to the increase or decrease of the copy number of large genome fragments with a length of more than 1 kb, which is mainly manifested as deletion and duplication at the submicroscopic level. CNV is an important component of genomic structural variation (SV). The mutation rate of the CNV locus is much higher than that of SNP, which is one of the important pathogenic factors of human diseases.

The secondary structure of RNA: the secondary structure of nucleic acid is mostly single-stranded RNA molecules. The secondary structure of RNA can be divided into helices (base pairs) and different types of rings (unpaired nucleotides surrounded by helices). The stem ring structure is a base pair spiral structure with a short unpaired ring at the end. This stem-ring structure is very common and is the basic unit for constructing large structural elements, such as clover structure (i.e., the four helical nodes in the transport RNA). The inner ring structure (short but not paired bases in the long base pair helix) and bulge (additional insertion in the helix strand, but no paired bases in the opposite strand) are also very common. Finally, pseudoknots and base triples also appear in RNA.
Because the secondary structure of RNA is almost all mediated by base pairs, it can be said that it is to determine which base pairs are in a molecule or complex. However, the traditional Watson - Crick base pair is not the only RNA pairing method, and the Horst's pairing method is also very common.

RNA secondary structure prediction : the RNA folding process is hierarchical. Even if the higher-order structure is not known, the secondary structure of RNA is sufficient to infer its function and other practical applications. Computational prediction is the mainstream method for identifying RNA secondary structure. Since the 1970s, many forecasting methods have been developed. Most of these methods try to identify the structure with minimum free energy (MFE), to conform to the hypothesis that RNA molecules may exist in MFE states like proteins. Many famous software applications have been developed by combining these methods. However, in the past 10 years, the accuracy of prediction has not improved significantly, and the calculation speed has not improved significantly. Another method based on machine learning (ML) was proposed to improve the prediction of RNA secondary structure. However, due to the limited accuracy of these methods, they have not received much attention. This is mainly due to the small size of the training dataset and the limitations of the simple ML model. With the explosive growth of RNA sequence data in recent years and the continuous improvement of machine learning technology, especially deep learning technology, the latest machine learning-based methods have replaced the current mainstream methods in terms of accuracy and applicability. We believe that these ml-based methods will inspire the next generation of prediction tools soon.

The secondary structure of the protein contains the interaction between local residues regulated by hydrogen bonds. The most common secondary structure is α- Spiral and β- Fold, in addition to β- Corner and random curl. After calculation, it is found that other helices, such as 310 helices and π helices, have favorable hydrogen bonding modes in energy, but these helices are very rare in natural proteins α. The screw can be found in the end only after it is wrapped in an unfavorable skeleton in the center. Tight corners and loose and flexible rings will connect more "regular" secondary structures. The arbitrary form is not a real secondary structure but a form of secondary structure without rules.

Protein secondary structure prediction: the early method of protein secondary structure prediction is based on the tendency of amino acids to form helix or fold, and sometimes it must be used together with the method of estimating the energy of forming a secondary structure. These methods can predict the three states of residues (spiral, fold, or curl) with about 60% accuracy. If multiple sequence alignment is used, the accuracy can be greatly improved to 80%. Multiple sequence alignment can know the complete distribution of amino acids at a certain position (including the position near it, generally 7 residues on each side), and the evolution process provides a clearer picture of the structure trend. For example, glycine at a certain position of the protein itself has shown that it is an arbitrary form. However, the comparison of multiple sequences shows that in 95% of the proteins after nearly one billion years of evolution, that is a favorable helical amino acid. In addition, if the average hydrophobicity is detected at that location, it will also be found that the residual solubility is α the spiral is consistent. Taken together, these factors show that glycine in the original protein is α Spiral structure, not an arbitrary shape. Many methods will combine the existing data to form the prediction of three states, including a neural network, hidden Markov model, and support vector machine. Modern prediction methods can also provide confidence scores for prediction results at each location. Accurate secondary structure prediction is an important element of tertiary structure prediction.

Machine learning is an algorithm designed to enable the computer to automatically analyze and obtain rules from data and use the rules to predict unknown data. Since the development of machine learning, many algorithms have been developed and applied in a wide range of fields. In the field of bioinformatics, the data-driven ml method can implicitly learn this mechanism from known data based on different RNA sequences or sequence characteristics. Deep learning, random forest, support vector machine, and other algorithms are widely used in protein structure prediction, protein function point prediction, and subcellular location prediction.