SummaryIn high-throughput proteomics, the development of computational methods and novel experimental strategies and their application often rely on each other. However, most computational approaches still make the assumption that each MS/MS spectrum comes from one peptide while there are numeroussituations where one MS/MS spectrum can contain fragment ions corresponding to two or more peptides. Examples include mixture spectra from co-eluting peptides in complex samples, spectra generated from data-independent acquisition methods and spectra from peptides with complex PTMs. We propose a new database search tool (MixDB) that is able to identify mixture MS/MS spectra from more than one peptide. We show that peptides can be reliably identified with up to 95% accuracy from mixture spectra while considering only a small fraction of all possible peptide pairs (speedup of four orders of magnitude). Comparison with current database search methods indicates that our approach has better or comparable sensitivity and precision at identifying single-peptide spectra while being able to identify 20% more mixture spectra at significantly higher precision.
Run MixDB as follows:
java -Xmx1200M -jar MixDB.jar [fasta file] [query file] [precursor mass tolerance] [outputfile]
This will search the sequence database and find the pair of peptides that best matched to the query spectrum. Precursor mass tolerance is in unit Da. Usually one should use a relatively large tolerance like 3Da to allow for the identification of mixture spectra even if the query is on high accuracy MS data.
After the search, MixDB uses a SVM to determine whether a match is significant.
- SVM classification is done using the svm-light package. Please download the binaries at http://svmlight.joachims.org/. Then, put the appropriate binaries ("svm_learn" and "svm_classify") into the "svm_light_linux" or "svm_light_windows" folder, depending on your system.
Use the mixdbSVMClassify.pl script to perform the classification.
Run the script as follows:
./mixdbSVMClassify.pl [search result file] [outputFile]Note: you might need to change the first line in the mixdbSVMClassify.pl script to specify the correct path for the svm_light binary.
Outputs are in tab-delimited format. Each column has the following meanings. We denote M as the query spectrum and A and B as the pair of peptides best matched to M. In the case of mixture matches some result columns have two values, separated by a "!".
Column Content 1-6 Query spectrum scan number 7 Query spectrum precursor mass 8 Precursor of A 9 Precursor of B 10 Peptide A (number after . is charge of peptide) 11 Peptide B 12 Protein name for peptide A 13 Protein name for peptide B 14 Raw score between M and A+B 15 Raw score between M and A only 16 Raw score between M and B only 17 Raw score between M and A divided by length of A 18 Raw score between M and B divided by length of B 19 Total explained intensity by A and B in M 20 Fraction of b presented in M for A 21 Fraction of y presented in M for A 22 Fraction of b presented in M for B 23 Fraction of y presented in M for B 24 Longest consecutive series of b for A 25 Longest consecutive series of y for A 26 Longest consecutive series of b for B 27 Longest consecutive series of y for B 28 Average fragment mass errors for A 29 Average fragment mass error for B 30 svm-score for matches, high score means at least peptide A or B is a significant match to M 31 svm-score for mixture matches, higher score mean both A and B are significant matches to M
A powerful tool for PTM discovery (Jan 2008, Journal of Proteome research, Vol 7. Issue 1)
From spectral networks to shotgun sequencing (June 2007, Nature Methods, Vol. 4 No. 6)
Identifying peptides without a database (May 2007, Journal of Proteome Research)
UCSD Computer Scientist Wins Young Investigator Award, Research on Snake Venom Proteins Highlighted (Nov 2006, UCSD)