Computational Mass Spectrometry

The computational mass spectrometry group, headed by Professors Vineet Bafna and Pavel Pevzner, focuses on developing algorithms to process mass spectrometry data. In our lab we have developed a number of tools for computational proteomics. Each one has it's own purpose and setting. These tools are free for download, or are also integrated into a web server.


MxDB March 11, 2014

Chemical cross-linking and mass spectrometry have recently been shown to constitute a powerful tool to study protein-protein interactions and to help elucidate the structure of large protein complexes. However, computational methods to interpret the complex MS/MS spectra from linked peptides are still in their infancy, thus making the high-throughput application of this approach largely impractical. Due to the lack of large annotated datasets, most current approaches do not capture the specific fragmentation patterns of linked peptides and therefore are not optimal for identification of cross-linked peptides. Here we propose a generic approach to address this problem and demonstrate it using disulfide-bridged peptide libraries to 1) efficiently generate large mass spectral reference data for linked peptides at a low cost and 2) automatically train an algorithm that can efficiently and accurately identify linked peptides from MS/MS spectra. We show that using this approach we can identify thousands of MS/MS spectra from disulfide-bridged peptides against proteome-scale sequence databases and significantly improve the sensitivity of identifying cross-linked peptides. This allows us to identify 60% more direct pairwise interactions between the protein subunits in the 20S proteasome complex than existing tools on cross-linking studies of the proteasome complexes. The basic framework of this approach and the MS/MS reference dataset generated should be a valuable resource for the future development of new tools for the identification of linked peptide.

Specialize February 14, 2014

Detection of complex post-translational modifications (PTMs) remains a challenging open problem in proteomics. Simple PTMs (e.g. methylation, deamidation etc.) can be readily identified in tandem mass spectrometry (MS/MS) by considering characteristic mass shifts in peptide fragment ions. However, more complex PTMs such as glycosylation and small ubiquitin-like modification (SUMOylation) present a more difficult problem because the modification changes the fragmentation pattern of the substrate peptide substantially, making current database search methods not suitable to identify MS/MS spectra from these modified peptides. We propose a novel approach to enable the expedited development of new algorithms for any types of PTM peptide fragmentation. Using SUMOylation as an example, we demonstrate how to generate large and reliable MS/MS training data from SUMOylated peptides and derive algorithms, Specialize, that learn PTM-specific fragmentation from the training data and apply appropriate false discovery rate (FDR) procedures. Benchmarking our new methods on several datasets of varying complexity from SUMOylation studies on we show that our methods are significantly more sensitive than current state-of-the art methods.

Tremolo August 12, 2013

Tremolo is a spectral library search tool that leverages the Spectral Library Generating Function (SLGF) concept to identify spectrum-spectrum matches (SSMs). The SLGF models the variability of replicate spectra as compared to reference library spectra. Given a similarity function (in our case cosine), SLGF yields an expected score distribution for each reference library spectrum. Tremolo is able to assign p-values to SSMs and it has been shown to increase the sensitivity of spectral library searches.

MixDB December 6, 2011

Most computational approaches makes the assumption that each MS/MS spectrum comes from one peptide while there are numerous situations where one MS/MS spectrum can contain fragment ions corresponding to two or more peptides. Examples include mixture spectra from co-eluting peptides in complex samples, spectra generated from data-independent acquisition methods and spectra from peptides with complex PTMs and cross-linked peptides. We propose a new database search tool (MixDB) that is able to identify mixture MS/MS spectra from more than one peptide. We show that peptides can be reliably identified with up to 95% accuracy from mixture spectra while considering only a small fraction of all possible peptide pairs (speedup of four orders of magnitude).

Resurrection of a clinical antibody January 5, 2011

Using the tool, GenoMS, we were able to sequence a new mouse hybridoma antibody directed against a member of the TNF-superfamily, lymphotoxin alpha (LT-𝛂). Details of the protein sequencing effort can be found here.

GenoMSJune 1, 2010

We developed the tool, GenoMS, for sequencing small samples of proteins. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%. More details can be found here.

Nonribosomal Peptides Dereplication and SequencingAugust 7, 2009

Nonribosomal peptides (NRPs) are of great pharmacological importance, but there is currently no technology for high-throughput NRP 'dereplication' and sequencing. We used multistage mass spectrometry followed by spectral alignment algorithms for sequencing of cyclic NRPs. We also developed an algorithm for comparative NRP dereplication that establishes similarities between newly isolated and previously identified similar but nonidentical NRPs, substantially reducing dereplication efforts. The homepage for this project can be found here.

Arabidopsis ProteogenomicsJanuary 8, 2009

Our study of the Arabidopsis proteome through tandem mass spectrometry revealed over 18,000 novel peptides not in the TAIR7 genome annotation release. Using Inspect, we identified over 144,000 peptides from 3 sequence databases; the six-frame translation of the Arabidopsis genome, an exon-graph based on ab initio gene predictions, and the TAIR7 proteome. From the novel peptides we predicted over 700 new gene models and over 600 corrections to current gene models. The peptides and predicted models can be accessed here.

Multistage mass spectrometryMay 29, 2008

Multistage mass spectrometry (collecting multiple MS^3 spectra from each MS^2 spectrum) and accurate precursor masses (but inaccurate fragment masses) have been demonstrated to lead to significant gains in peptide identification via database search but have had a limited impact in de novo peptide sequencing. Our Multi-stage Spectral Networks package addresses both of these in a rigorous probabilistic framework for analyzing spectra of overlapping peptides, resulting in both accurate de novo peptide sequencing from multistage mass spectra (despite the inferior quality of MS^3 spectra) and improved interpretation of spectral networks. Additional details and the open-source package are available here.

Phosphate Localization ScoreApril 4, 2008

Phosphate Localization Score is an algorithm which determines the confidence of the placement of a phosphate on a given residue. This method is similar to the AScore, and is described in Albuquerque et al., Mol Cell Proteomics 2008. The program is integrated with the Inspect package, download available here. A tutorial for using the program is in the Inspect documentation, here.

MS-DictionaryNovember 30, 2007

MS-Dictionary is a software to generate all plausible de novo interpretations of a tandem mass spectrum(spectral dictionary) and matches them against a protein database quickly. It enables proteogenomic searches in six-frame translation of genomic sequences that may be prohibitively time-consuming for existing database search approaches.

MS-GeneratingFunctionNovember 28, 2007

MS-GF is a software for computing the generating function of a tandem mass spectrum.The generating functions and their derivatives represent new features of tandem mass spectra that improve peptide identifications. Further, they enable one to rigorously compute error rates of peptide identifications and get better sensitivity-specificity trade-off of existing MS/MS search tools.

MS-ClusteringNovember 26, 2007

MS-Clustering is a new program aimed at improving the analysis large MS/MS datasets by removing many of their redundant or low quality spectra. MS-Clustering is capable of reducing the number of spectra submitted for analysis from a large 10+ million dataset by 90% while increasing the number peptide/protein identifications by up to 10%.

Spectral NetworksOctober 24, 2007

Spectral networks are a novel approach to the identification of MS/MS spectra that detects and combines spectra from overlapping peptides or modified variants of the same peptide. This approach allows for the blind indentification of unexpected post-translational modifications and highly modified peptides. The spectral networks software package is now available in open-source and Windows-binary versions.

PepNovoOctober 4, 2007

A new version of PepNovo is released. It contains optional quality filtering and models for several MS instrument types.

Web serverJuly 25, 2007

The web server hosting all of our software is up and running. Users may sign up for an account and search spectra. The server posts jobs to a large compute grid.

Phosphorylation searchJuly 10, 2007

Inspect has been trained to score phosphorylated MS/MS spectra. The new scoring function has been trained on LTQ machines, and works great.

Proteogenomics Consortium

HHMI suppports the Bioinformatics [Under]graduate Research Consortium in Comparative Proteogenomics at UCSD. Proteogenomics is a new research area that utilizes the whole genome MS/MS datasets to better characterize the genomic and proteomic annotations on a global scale. The consortium provides an opportunity to the undergraduate and fresh graduate students to get hands-on research experience with real and unsolved bioinformatics problems in this upcoming field. More information.

Latest Releases





Inspect, MS-Alignment


























Spectral Networks

Sept 2007





Copyright Notice


Media Coverage

Nonribosomal Peptide Dereplication and Sequencing (Scientific American, Genetic Engineering News, Natural Products Industry Insider and Genome Web Daily News)

A powerful tool for PTM discovery (Jan 2008, Journal of Proteome research, Vol 7. Issue 1)

From spectral networks to shotgun sequencing (June 2007, Nature Methods, Vol. 4 No. 6)

Identifying peptides without a database (May 2007, Journal of Proteome Research)

UCSD Computer Scientist Wins Young Investigator Award, Research on Snake Venom Proteins Highlighted (Nov 2006, UCSD)