Inspect: A Proteomics Search Toolkit
Copyright 2007, The Regents of the University of California
Table of Contents
Unrestricted Search Tutorial
Inspect requires a database (a file of protein sequences) in order to interpret spectra. You
can specify one or more databases in the Inspect input file. Databases can be stored in one
of two formats: A .trie file (bare-bones format with sequence data only), or a .ms2db file
(simple XML format with exon linkage information). These two formats are discussed below.
Sequence Databases (FASTA)
For efficiency reasons, Inspect processes FASTA files into its own internal format before
searching. A database is stored a two files, one with the extension ".trie" (which holds peptide sequences),
and one with the extension ".index" (which holds protein names and other meta-data). To prepare
the database, first copy the protein sequences of interest into a FASTA file in the Database
subdirectroy. Then, from the Inspect directory, run the Python script PrepDB.py as follows:
python PrepDB.py FASTA MyStuff.fasta
Replace "MyStuff.fasta" with the name of your FASTA database. After PrepDB has run, the database
files MyStuff.trie and MyStuff.index will be ready to search. PrepDB.py also handles
Swiss-prot ".dat file" format as input.
Inspect can perform this processing automatically
(see the "SequenceFile" option in the searching documentation). Running
PrepDB.py is the preferred method since it creates a database file which can be re-used by many
Note: The database should include all proteins known to be in the sample, otherwise some spectra
will receive incorrect (and possibly misleading) annotations. In particular, most databases should
include trypsin (used to digest proteins) and human keratins (introduced during sample processing).
The file "CommonContaminants.fasta", in the Inspect directory, contains several protein sequences you
can append to your database.
Decoy records (ShuffleDB)
Databases including "decoy proteins" (shuffled or reversed sequences) are emerging as the
gold standard for computing false discovery rates. Inspect can compute p-values in two
Compute the empirical false discovery rate by counting the number of hits to
invalid proteins. This is the recommended method. Given an f-score cutoff,
Inspect computes the number shuffled-protein hits above that threshold - these hits
are all invalid. Inspect
then estimates the number of invalid hits which happen to fall within valid proteins.
This count provides an empirical false discovery rate, which is reported as the
By fitting the distribution of F-scores as a mixture model, in the manner of
PeptideProphet. This is how the initial p-values output by inspect are computed.
Use PValue.py without the "-S" option to compute p-values using this method.
To compute empirical false discovery rates:
Use the script ShuffleDB.py to append decoy records to a database before searching. Decoy records have the
flag "XXX" prefixed to their name.
After searching, use the script PValue.py (including the "-S" option) to carry out this analysis.
The MS2DB file format is a simple, extensible XML format for storing proteins. The main benefits of
using MS2DB format instead of FASTA files are:
Reduced redundancy - Each exon is stored once, and only once
Splice information - All isoforms (and sequence variants) corresponding to a locus are grouped
as one Gene, which reduces the usual confusion between proteins and records.
Site-specific modifications - Known modifications, such as phosphorylation, can be
explicitly indicated. Considering these site-specific modifications is much cheaper than
a search that attempts to discover new modifications.
Rich annotations - The format has places to store information such as accession numbers from
sequence repositories, species name, etc.
You can use the script BuildMS2DB.jar to generate a MS2DB file. As input, you will need:
One or more files in GFF3 format containing exon predictions
A FASTA file containing the sequences on which the exons are predicted
For more details on using BuildMS2DB.jar (and MS2DBShuffler.jar for building a decoy database) please read the information on proteogenomics found here