A large-scale prediction of protein-protein interactions based on random forest and matrix of sequence

. Protein-protein interaction (PPIs) is an important part of many life activities in organisms, and the prediction of protein-protein interactions is closely related to protein function, disease occurrence, and disease treatment. In order to optimize the prediction performance of protein interactions, here a RT-MOS model was constructed based on Random Forest (RF) and Matrix of Sequence (MOS) to predict protein-protein interactions. Firstly, MOS is used to encode the protein sequences into a 29-dimensional feature vector; Then, a prediction model RT-MOS is build based on random forest, and the RT-MOS model is optimized and evaluated using the test set; Finally, the optimized model RT-MOS is used for prediction. The experimental results show that the accuracy rates of the RT-MOS model on the benchmark dataset and the non-redundant dataset are 97.18% and 91.34%, respectively, and the accuracies on four external datasets of C.elegans, Drosophila, E.coli and H.sapiens are 96.21%, 97.86%, 97.54% and 97.75%, respectively. Compared with the existing methods, it is found that it is superior to the existing methods. The experimental results show that the model RT-MOS has the advantages of saving time, preventing overfitting and high accuracy, and is suitable for large-scale PPIs prediction.


Introduction
Protein-protein interaction (PPIs) is an important part of many life activities in organisms. Almost all life processes are related to protein interactions, such as metabolism, signal transduction, cell cycle regulation, metabolism, apoptosis and immune response. The study of protein interaction can help people to fundamentally understand the mechanism of disease, so as to prolong the life of patients with genetic diseases and improve their quality of life. In terms of PPIs prediction, many high-throughput experimental methods have emerged in recent years [1][2][3]. However, those methods are chemical experimental methods, which are timeconsuming and laborious, and large-scale protein interaction prediction is difficult to achieve. Machine learning makes it possible to predict large-scale protein interactions. So far, a large number of machine learning models have emerged, including support vector machines (SVM), neural networks (NN), naive Bayes, Knearest neighbor and decision tree, which have been used to predict protein interactions [4][5][6]. Although the calculation methods of protein interaction prediction have been developed to some extent, there are still some limitations. For example, general machine learning models may not be able to deal with the noise value of protein sequences well [7]. Random Forest (RF) has high prediction accuracy, has a good tolerance for outliers and noises in data, and is not prone to over fitting problems. It can well solve the shortcomings of traditional machine learning such as decision trees [7]. RF has also been applied in protein interaction prediction. For example, Qi et al. [8] proposed a new method to predict protein interaction prediction by calculating similarity based on random forest, and achieved a positive prediction rate of 70.45%. In 2014, Bhowmick et al. [9] built a protein interaction prediction model based on random forest, and obtained 89% accuracy, which confirmed the effectiveness of RF algorithm applied to protein interaction prediction. PPIs need to convert heterogeneous amino acid sequences into homogeneous vector features (i.e. protein coding). In 2019, Gui et al. [10] proposed a MOS protein coding method based on deep learning. This method considers the frequency information of the entire amino acid sequence, and has the advantages of simple coding and time-saving. In view of the advantages of random forest in processing noise and over fitting, as well as the advantages of simple and timesaving sequence matrix coding, we build protein interaction prediction model based on random forest and sequence matrix to optimize the prediction performance of protein interaction prediction model.

Amino acid classification
First, 20 kinds of conventional amino acids are divided into 7 groups according to the dipole and volume of side chain (see Table 1). Then, referring to the classification in Table 1, replace the amino acid sequence with the corresponding category of amino acid in the amino acid classification table, and the dimension of the sequence matrix will be 20×20 down to 7×7.

Algorithm of MOS
Hypothetical non-empty finite set: Ω = ｛w 1 ,...,w N ｝, where N is the number of categories of the sequence. Given sequence: S=S 1 ,S 2 ,,..., S L , where L represents the length of sequence S, S i ЄΩ, 1≤i≤L. The sequence matrix of a given sequence S can be expressed as: Input sequence: S=S 1  Step 5. If i≥1, go to step 2. To reduce the computational vector, we first classify 20 amino acids into 7 classes according to the amino acid classification method in Table 1. Thus, a protein sequence can be represented by a matrix of 7×7. The next step is to standardize m ij of each matrix element ranging from 0 to 1.To distinguish the lengths of the protein. Finally, a total 29-dimensional vector has been built to represent each protein sequence.

Random Forest( RF)
Random Forest (RF) is an algorithm that Breiman et al. [11] combined random feature selection method and Bagging idea to integrate multiple decision trees. In most cases, Bagging method is used for training in random forests, samples are selected randomly, and samples are trained by playback sampling. RF is an integrated classifier constructed by several decision tree models {h(X,θ K ),k=1,...,K} in Bagging integration mode, where θ K is an independent random vector with the same distribution, and K is the number of decision trees in the forest. Input Where n represents the sample size of dataset D, X represents the set of p-dimensional feature vectors, and Y represents the category vector. The RF margin function can be expressed as: Where, j≠Y, mr (X,Y) represents the margin function, I (.) represents the indicative function, and h(X,θ K ) represents the classification model sequence.

Benchmark data set and Non-redundant data
set Benchmark data set and Non-redundant data set were provided by Pan et al. [12]. Benchmark data set includes positive correlation data set and negative correlation data set. The positive data set is from the Human Protein Reference Database (HPRD, 2007), and the negative datasets is constructed by subcellular localization information. Most protein sequences range in length from 100 to 1000. Protein pairs containing less than 50 residues and uncommon amino acid sequences (containing B, J, O, U, X and Z) are deleted. The data set obtained includes 36591 pairs of positive correlation samples and 36324 pairs of negative correlation samples. 30000 positive correlation samples and 30000 negative correlation samples are randomly selected each time to form a training set, and the rest was used as a test set. On the basis of Benchmark datasets, delete protein sequence pairs with sequence identity ≥ 25%, and the resulting data set is non redundant. The datasets contains 3899 positive correlation protein pairs and 4262 negative correlation protein pairs.

External Datasets
In order to verify the prediction performance of RT-MOS model, in addition to the benchmark data set and non redundant data set, four different species of datasets are also selected as external datasets. See Table 1 for details of the data set. It can be seen from Table 1 that data volume of positive correlation samples and negative correlation samples of four species are divided according to 1:1 ratio. The training set accounts for about 60% of the total sample volume, and the remaining 40% is used as the test set.

Experimental Design
The RT-MOS model is designed and implemented based on the Keras framework. It is written in Python language and supports both CPU and GPU. The flow chart of experimental design is shown in Fig.1, mainly including data acquisition, data processing and model building. Data acquisition refers to obtaining protein interaction data sets from HPRD, Swiss Port, PIR and UniProt databases; Data processing refers to the use of MOS coding to extract features from protein interaction data sets and convert letter sequence data into computer recognized feature vectors; The model construction is to input the coded feature vectors into machine learning, train the protein interaction prediction model by adjusting and optimizing parameters, test the model using test sets, and finally evaluate and compare the model.

Prediction performance on benchmark data set
In order to study the prediction performance of the model RT-MOS, three prediction models, KNN-MOS, DT-MOS and AB-MOS, were constructed by combining K-Nearest Neighbor (KNN), Decision Tree (DT), Adaptive Boosting (AB) and MOS feature extraction methods. Through experiments, the average prediction performance of the four models is shown in Table 3. It can be seen from Table  3

Prediction performance on non-redundant data set
In order to evaluate the generalization performance of the RT-MOS model, we tested the performance of the RT-MOS model on non redundant data set, and obtained 90.34% accuracy. The specific prediction results are shown in Table 5. Table 5 shows that the accuracy, recall and AUC of RT-MOS on non redundant data set are 91.34%, 95.52% and 93.76% respectively. However, Gui et al. [10] and Shen et al. [13] obtained 88.29% and 85.84% accuracy on non redundant data set, respectively. It can be seen that the accuracy of RT-MOS on non redundant data set is better than that of DNN-MOS and SAE-AC. The protein interaction of RT-MOS model on low similarity data set is still effective, and it can be used to predict protein interaction on low similarity data set.  Table 5 shows that the accuracy of RT-MOS, DNN-MOS and SAE-AC models on non redundant datasets is 91.34%, 88.29% and 85.84% respectively. To sum up, the prediction performance of RT-MOS, DNN-MOS and SAE-AC models on non redundant datasets is better than that on non redundant datasets. This shows that in protein interaction prediction, the sequence identity of data set has a great impact on the performance of models.
Reducing the sequence identity of data set will lead to a decline in prediction performance.

Prediction performance on external datasets
To further verify the generalization performance of the model RT-MOS, four external data sets (C. elegans, Drosophila, E. coli and H. sapiens) are applied to the model RT-MOS (see Table 6). It can be seen from Table  6 that the accuracy of the model RT-MOS on the four external data sets of C.elegans, Drosophila, E.coli and H.sapiens is 96.21%, 97.86%, 97.54% and 97.75% respectively. Among them, the prediction performance of Drosophila, E.coli and H.sapiens is better than that on the benchmark data set (the accuracy rate is 97.18%). The experimental results in Table 6 show that the model RT-MOS has also achieved good prediction performance in protein interaction prediction of other species, and the model RT-MOS has good generalization ability.

Comparison with existing methods
In order to verify the effectiveness of the model RF-MOS in protein interaction prediction, the model RF-MOS is compared with existing methods, and the comparison results are shown in Table 7. The datasets used by all methods in Table 7 are human datasets downloaded from the Human Protein Reference Database (HPRD). It can be seen from Table 7 that the accuracy of existing methods is between 83.90% and 94.10%, and the best result is the model CS-SVM. The model RT-MOS and DNN-MOS adopt the same coding method, but RT-MOS achieves 97.18% accuracy, which is significantly better than DNN-MOS (93.35%), with an increase of 3.83 percentage points. Through comparison, we found that the model RT-MOS can improve the prediction performance of existing methods, and have the advantages of saving time, preventing over fitting and high accuracy, and is suitable for large-scale protein interaction prediction.

Conclusion
The random forest algorithm has been applied in many fields. Although it has also been applied in protein interaction prediction, the prediction performance still needs to be improved. Therefore, we used the random forest algorithm, combined with MOS protein sequence coding method, to build the RF-MOS protein interaction prediction model, and obtained 97.18% accuracy, 97.19% AUC, 95.96% recall, and 96.44 prediction. Compared with the models constructed by other machine learning algorithms, the prediction performance of RF-MOS model is better than that of KNN-MOS, DT-MOS, AdaBoost MOS and other models. The reason for the good prediction performance of RF-MOS model may be that random forest is an integrated learning method, integrating multiple decision trees can avoid the defect of a single decision tree. Since bagging is equivalent to sampling samples and features, it can avoid over fitting. The model RF-MOS still achieves good accuracy when using low dimensional feature vectors, avoiding problems such as large error, low accuracy and over fitting. Moreover, the RF training is fast, the accuracy of prediction results is high, and it can carry a large number of inputs. In addition, the MOS coding dimension is low and time-saving, so the model RF-MOS is suitable for processing large-scale sample data. In view of the above advantages, RF-MOS model can be a useful complement to protein interaction prediction.