Bacterial Proteogenomic Annotation

Download   Publications   Documentation
Contact: Nitin Gupta [ngupta (at) ucsd.edu]

Summary

While bacterial genome annotations have significantly improved in recent years, techniques for bacterial proteome annotation (including post-translational chemical modifications, signal peptides, proteolytic events, etc.) are still in their infancy.  At the same time, the number of sequenced bacterial genomes is rising sharply, far outpacing our ability to validate the predicted genes, let alone annotate bacterial proteomes. In this project, we use tandem mass spectrometry (MS/MS) to annotate the proteome of bacterial genomes. In particular, we provide a comprehensive map of post-translational modifications in a bacterial genome, including a large number of chemical modifications, signal peptide cleavages and cleavage of N-terminal methionine residues. We also detect multiple genes that were missed or assigned incorrect start positions by gene prediction programs and suggest corrections to improve the gene annotation. We believe that complementing every genome sequencing project by an MS/MS project would significantly improve both genome and proteome annotations for a reasonable cost.

Documentation

Following are the major steps in the proteogenomic annotation of a bacterium:

  • Prepare a six-frame translation of the whole genome, and run Inspect search against it (including comman contaminant proteins).

  • A randomized decoy database of the same size as the regular database is used.

  • Peptides are chosen to limit the false discovery rate below 5% (at the peptide level). Note that this typically corresponds to less than 1% error rate at spectrum level.

  • Analyze the position of the identified peptides in the genome to find candidates for gene corrections.

  • Identify non-covered peptides that may represent signal peptides or other proteolytic events.

  • Run MS-Alignment search against the proteins database to identify PTMs.

 

Downloads

Inspect and MS-Alignment for running database searches. Inspect toolkit also provides several post-processing scripts.

  • Script to format fasta files (one line per sequence).

  • Script for generating six frame translation of a genome (you also need this). The genome fasta file must be formatted using the previous script before using this one.

Publications

Comparative Proteogenomics: Combining Mass Spectrometry and Comparative Genomics to Analyze Multiple Genomes.
N. Gupta, J. Benhamida, V. Bhargava, D. Goodman, E. Kain, I. Kerman, N. Nguyen, N. Ollikainen, J. Rodriguez, J. Wang, M.S. Lipton, M. Romine, A. Osterman, V. Bafna, R.D. Smith and P.A. Pevzner.
In preparation.


Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation.

Nitin Gupta, Stephen Tanner, Navdeep Jaitly, Joshua Adkins, Mary Lipton, Robert Edwards, Margaret Romine, Andrei Osterman, Vineet Bafna, Richard D. Smith and Pavel Pevzner.
Genome Res. 2007 Sep;17(9):1362-77