Contact: Natalie Castellana [ncastell (at) ucsd.edu]
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides were determined using Inspect to search against 3 different representations of the genome; a six-frame translation, an exon splice-graph, and the currently annotated proteome. Using the gene finding program, AUGUSTUS, and our novel peptides that occurred in clusters we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models.
The newly predicted gene models and the support each model received from ESTs, peptides, and current annotation can be downloaded here AUGUSTUS_Corrected_Genes.gff,
- Tracks of the peptides (both novel and those confirming current models) can be uploaded to TAIR GBrowse for visualization.
- GFF Formated files of all novel and TAIR peptides can be downloaded here.
- The exon splice graph building and searching is a functionality uniquely provided by Inspect. The documentation can be found here.
- The complete list of non-novel peptides, their mappings to TAIR9 genes, and supporting spectra can be accessed here.
A powerful tool for PTM discovery (Jan 2008, Journal of Proteome research, Vol 7. Issue 1)
From spectral networks to shotgun sequencing (June 2007, Nature Methods, Vol. 4 No. 6)
Identifying peptides without a database (May 2007, Journal of Proteome Research)
UCSD Computer Scientist Wins Young Investigator Award, Research on Snake Venom Proteins Highlighted (Nov 2006, UCSD)