MSPoisDM: A Novel Peptide Identification Algorithm Optimized for Tandem Mass Spectra

. Tandem mass spectrometry (MS/MS) plays an extremely important role in proteomics research. Thousands of spectra can be generated in modern experiments, how to interpret the LC-MS/MS is a challenging problem in tandem mass spectra analysis. Our peptide identification algorithm, MSPoisDM, is integrated the intensity information which produced by target-decoy statistics, although intensity information often undervalued. Furthermore, in order to combine the intensity information for better, we propose a novel concept scoring model which based on Poisson distribution. Compared with commonly used commercial software Mascot and Sequest at 1% FDR, the results show MSPoisDM is robust and versatile for various datasets which obtained from different instruments. We expect our algorithm MSPoisDM will be broadly applied in the proteomics studies.


Introduction
In biological sample analysis, mass spectrometry (MS)based proteomics has evolved into an indispensable approach [7,11]. In proteomics experiment, proteins can be cleaved into peptides by enzyme-selected, then separated and enter MS for subsequently analyzing [7,12]. Thousands of fragmentation spectra generated in modern proteomics experiment, how to infer peptide sequence is a challenging and propose peptide identification algorithms are necessity [1,12]. Algorithms model greatly affect the efficient and accuracy of searching spectra [10]. Scoring function is the core of protein identification algorithms, integrated the current algorithms types, they could be divided into four categories as follows [14,16,19].
(1) Correlation matching model: by mathematical simulation of protein digestion and mass spectrometry detection process, the theory of enzyme peptides transform into the corresponding predicted spectrum, then the degree of correlation between predicted spectrum and experimental spectrum needed to be evaluated by mathematical approach, obtained the appropriate search results. Representative algorithms such as Sequest [6]and pFind [12].
(2) Probabilistic matching model: utilizing the statistical probability which obtained by census the frequency of a certain value in a certain error range in protein database to indicate the credibility of matching [15,17], then constructed a reasonable algorithm model, searching the correct peptides. Representative algorithms such as Mascot, OMSSA [2], X!Tandem [4], Andromeda and ProVerB [18].
(3) Random matching model: using the information of proteins mass distribution and theory of enzyme peptides mass distribution, the corresponding spectrum was divided into several intervals, by calculating random probability of each section matching to the selected peptides, then built the identification algorithms model. Representative algorithms such as SCOPE [3] and Probity [8].
(4) Empirical weight matching model: by assigned different empirical weights to key ions, consecutive occurrence, intensities, pair-wise amino acid patterns and ect. Representative algorithms such as MassWiz [19] and SQID [13]. Integrated the characterizes information to scoring algorithms was inevitable, which could improve the confidence of searching results and efficiency. value often as the main characterize information to be assigned the mainstream search engines, contained Mascot, Seuqest, X!Tandem and OMSSA. Peak intensity characterizer often undervalued because of its unreliability, in order to integrate the intensity information, favourable peaks selecting manner was imperative, SQID and ProVerB utilized diverse manner to select efficiency peaks, and studied the fragmentation intensity patterns, built the protein identification algorithms, achieved excellent results. Dispec proposed a novel concept characterize information based on peptide / m z / m z matching discriminability (PMD) [1], which abundant reflected the properties of experimental spectrum. Hence, appended more luxuriant characterize information would improve the efficiency and reliable of identification results [6]. In this paper, we proposed a novel peptide identification algorithm, named MSPoisDM, which integrated a brandnew concept characterize information Peak Intensity Identification-ability (PII). PII measured the degree of real matching, meantime, we built a novel scoring model for adding PII information. To validate the reliability and accuracy of MSPoisDM, we utilized diverse datasets from various mass spectrometer platforms to test, compared with Mascot and Sequest at 1% FDR level [9], MSPoisDM showed more robust and higher identification.

MSPoisDM Identification Algorithm
MSPoisDM, which in virtue of Poisson distribution to construct a novel scoring model and consider to add the PII information. We adopted Matlab (version: 8.1.0.604. (R2013a)) as the programming language. How to via the training experimental spectra to obtain PPI characterize information was crucial for our identification algorithm. Here, we through the following aspects for introducing the algorithm designing process.
(1) Isotopes discarding: plentiful isotopes exist in nature, so experimental spectra did. The existence of isotope peaks led to more random matching instead of real matching, the key of discarding isotopes was that correctly judged isotope peaks or not. The specific rules as following: if the two peaks closer than were considered as isotope peaks and the lower intensity peak needed to be discarded. This treatment could reduce random matches and enhance the accuracy.
(2) Peaks selecting: different peptide identification algorithms had diverse manner to select efficient peaks. Sequest and SQID selected the strongest 200 and 80 peaks in each experimental spectra respectively; OMSSA divided the spectra into several bins and then selected the top 5 peaks in each bin. MS Amanda selected the most intense peaks in each . In this article, we adopted dynamic approach which had been reported by ProVerB to select peaks, MSPoisDM selected the top 6 peaks in window.
(3) Extracting PPI characterize information: integrated abundant characterizes information to enhance the accuracy of scoring algorithms were necessary. PPI was a measure of real matching or not, and the specific extraction process included three aspects: (a) Training dataset: utilized the rational datasets for training was extremely important. The training spectra of MSPoisDM was extracted from the identified spectra which Mascot, Sequest and ProVerB all identified and be controlled by 1% FDR level. Hence, we considered the training spectra were collect identified. (b) Statistical method: different statistical method generated various results, in order to obtain PII characterize information, statistical process comprised three aspects: firstly, confirm key ion type, here we only defined , , , , , as key ion type; secondly, divided the peak intensities which had been normalized into 12 intervals, the details showed in table 1. Meantime, the method of normalization just like the following formula: (1) Where was the mean of the top three peaks, enhance the reliability of the statistical method. Third, searching the training data set based on forward and reverse reference sequence respectively, recorded the matching results. (c) Quantitative mathematical: quantified the statistical results involved on the above was crucial for MSPoisDM, we adopted the following formula to quantify, which not only retained variation, but also made the quantitative results get better smoothness. Specific process as follows: ( 2) Where denoted the sum of key ion types; denoted the sum of the number of intensity interval; denoted the forward reference sequence; denoted the reverse reference sequence; was the number of fragment ion matches which ion type was and intensity value located at interval based on forward sequence; was the number of fragment ion matches which ion type was and intensity value located at interval based on reverse sequence; reflected the degree of real matching which ion type was and intensity value located at interval. Table 1 was the calculated PPI value. (4) False Discovery Rate (FDR) calculated: no search engine could ensure all identified results were correct, the peptide spectrum matches (PSMs) were exported to calculate the FDR threshold. We used our in-house Matlab code to extract Mascot and MSPoisDM output result files which peptide length , Sequest results were extracted from output files which PSMs with the highest rank and , meantimem, the peptide length . And the FDR was calculated by Kall's menthod, respectively. The specific formula as follows: (3) (5) Scoring algorithm: scoring function is the heart of the peptide identification algorithms. In this article, firstly, we considered three-dimensional characterizes into MSPoisDM, contained fragment matches, consecutive fragment matches and b/y-ion matches [1]; secondly, constructed the scoring function based on Poisson distribution, respectively; finally, integrated PPI characterize information into scoring model. Specific details as follows: (a) Fragment matches: proposed a universal scoring function for various strategies is hard. We solved the problem by utilizing Poisson distribution to build appropriate function. the formula of Poisson distribution as below: (4) Where reflected the number of fragment matches; reflected the probability of matches, which embodied the confidence of fragment matches; reflected the theoretical mean of fragment matches, and the value of could be calculated from the following formula.
The preliminary score of fragment matches calculated from the following formula. Where is the preliminary score of fragment matches, the function showed the more fragment matched, the higher score obtained. In order to integrate the PPI information, we needed to re-scored the fragment matches.
Where was the final score of fragment matches, and reflected the confidence of the matching efficient.
(b) Consecutive fragment matches: consecutive fragment matches characterize information was hard to integrate into scoring algorithms, we via to take fragment matches and consecutive fragment matches as two independent information and scored separately, which could improve the efficiency of MSPoisDM. If termed two fragment matches as consecutive matches, they must satisfy two conditions: belonged to the same ion-type and the differ just equal the mass of a residue. And the scoring process as follows: Where denoted the number of consecutive fragment matches; reflected the probability of consecutive fragment matches; reflected the theoretical mean of consecutive fragment matches and calculated from the following method: Where reflected the random consecutive match probability, which reported by ProVerB; reflected the number of theoretical consecutive matches. When , the probability arrived at maximum. (11) Like the scoring strategy of fragment matches, here, we also needed to calculate the preliminary of the consecutive matches. The specific process as follows: (12) Where is the preliminary score of the consecutive fragment matches. Then the final scoring strategy which integrated PPI information as following formula: If two matches were a couple of consecutive match, and corresponding PPI value were and respectively, and the value of could be calculated by following formula:  (c) b/y-ion matches: b/y-ion were the mainstream ion type under CID environment. Evaluated the efficiency of b/yion matches could improve the robust and accuracy of peptide identification algorithm. Hence, we considered the b/y-ion matches as the independent information into scoring model. (15) Where reflected the number of b/y-ion matches; reflected the probability of b/y-ion matches; reflected the theoretical mean of b/y-ion matches and calculated from the following method: (16) Where 0.02 was the b/y-ion random probability, was the number of the theoretical b/y-ion matches. When , the probability obtained the maximum value.
Similarly, the preliminary score of b/y-ion matches could be obtained by the following formula: (18) Meantime, the final score of b/y-ion matches as follows: (19) Where was the score of b/y-ion mates.
(d) The score of candidate peptide: the score of candidate peptide could be calculated by the following formula: (20) Where was the score of candidate peptide, it measured the similarly degree between experimental spectra and theoretical spectra. The highest score of candidate peptides was treated as the final searching result.

MS/MS Datasets and Search Engine
We utilized various data sets which based on different instrument platforms. Standard mixtures of 18 proteins obtained from four types of MS instruments, included Thermo Finnigan LTQ-FT, Thermo Finnigan LCQ DECA, Thermo Finnigan LTQ and Micromass/Waters QTOF Ultima. In order to narrate convenience, we abbreviated the names of mentioned above as FT, LCQ, LTQ and QTOF, and public download website is https://regis-web.systemsbiology.net//PublicData sets/. The public data sets of E.coli downloaded from http://macrottelab.org/MSdata/Data_03/, which contained three sub-datasets, named E.coli1, E.coli2 and E.coli3. S.pneumoniae D39 dataset based on LTQ-Orbitrap was obtained from http://bioinformatics.jnu.edu.cn /software/proverb/, not only served as the training data set for extracting PPI characterize information, but also utilized as test data set; the data set of yeast was obtained from ophthalmology central of Sun Yat-Sen university, the data set was generated by HCD. Mascot and Sequest were widely used in proteomics research, which were adopted to compare with MSPoisDM. The version of Mascot and Sequest were 2.3 and 28.13 respectively. When used Mascot engine to search, the dta format files transferred into Mascot generic format (Mgf) files by merge.pl which was download from Mascot official web. Dta format files as input for Sequest and MSPoisDM. In addition, searching criteria were applied for Mascot, Sequest and MSPoisDM, specific contained cysteine (+57.021464 Da, Carbamidomethylation) defined as fixed modification, methionine (+15.994915 Da, oxidation) defined as variable modification and full tryptic specificity. Other parameters were set in table 2.

Results
MSPoisDM was compared with Mascot and Sequest after FDR calculation, the data set of S.pneumoniae D39 showed MSPoisDM & Mascot had higher overlap than MSPoisDM & Sequest. The overlap of MSPoisDM &Mascot was 88.3%, but MSPoisDM & Sequest only was 74.0%. Figure 1 showed the overlap between the two from Mascot, Sequest and MSPoisDM.  The data sets of standard mixtures of 18 proteins, which instruments contained FT, QTOF, LCQ and LTQ. MSPoisDM identified peptides was the most of the algorithms which mentioned above from the instruments except LTQ, showed its robust and steady. Meantime, MSPoisDM identified more spectra than Sequest from any MS instrument. Figure 2 and Figure 3 were the identified peptides and spectra from the four instruments mentioned above respectively. The data sets of E.coli, which contained three subsets E.coli1, E.coli2 and E.coli3, the identified peptides and spectra from E.coli data sets were the most of all the three search engines, showed MSPoisDM had high identified and superiority. Specify details revealed in Figure 4 and Figure 5. For verifying the accuracy of MSPoisDM, we should calculate the overlap between the two search engines of all. The peptides which identified at least by two search engines were defined as high confidence peptides. According to the Figure 6 and Table 3 showed MSPoisDM had more high confidence peptides than others. In order to recount convenience, in Table 3 and Table 4, Mascot, Sequest, MSPoisDM, high confidence peptides and high confidence spectra abbreviated M, S, MP, H_P and H_S .   For verifying the generality of MSPoisDM, we utilized the data set of yeast which was generated by HCD. The searching results showed MSPoisDM identified more peptides than Mascot. Figure 8 and Figure 9 revealed the identified peptides and spectra of Mascot and MSPoisDM respectively.

Discussion
MSPoisDM proposed a novel peptide identification algorithm optimized for tandem mass spectra, and integrated the PPI characterize information, according to the diversity data sets from different MS instruments, showed MSPoisDM robust, accuracy and steady. Meantime, for verifying the generality of MSPoisDM, we adopted the yeast data set from HCD to test [5], and MSPoisDM identified more peptides than Mascot. Hence, MSPoisDM was a universal peptide identification algorithm for tandem mass spectra.