Inspect: A Proteomics Search Toolkit
Copyright 2007, The Regents of the University of California
Table of Contents
Overview
Copyright information
Installation
Database
Searching
Analysis
Basic Tutorial
Advanced Tutorial
Unrestricted Search Tutorial
Analysis
Inspect writes search results to a tab-delimited file. Up to ten search hits are written for each spectrum,
but typcially all but the first can be discarded. The most important fields in the output are
columns 1 and 2 (the spectrum searched), column 3 (the annotation), and column 13 (the p-value,
a measure of match confidence).
The p-value is computed using the F-score. The F-score is a weighted sum of two factors. First is the
MQScore, or match quality score (in column 6). Second
is the delta-score (in column 14), the difference in MQScore between this match and the best alternative.
Because delta-score is highly dependent on database size and search parameters, Inspect takes the ratio of
the delta-score to the average delta-score for all top-scoring matches.
There are two methods to compute the p-value for a match. The preferred method is to employ a decoy
database. This method requires you to generate shuffled protein records before search using the "ShuffleDB" script
(see the Database section for details). Then, run the PValue script to compute the empirical false discovery
rate for a given f-score cutoff. This empirical false discovery rate is reported as the match's p-value. One
p-value curve is used for singly- and doubly-charged spectra, another for triply-charged spectra.
The second method for computing p-values is to fit a mixture model to the empirical distribution of
F-scores. The method is based very closely on PeptideProphet. This method is the one used by the Inspect
executable, and is also available using the PValue.py script.
Sometimes you may wish to use a common p-value distribution for several different results-files. This is
appropriate, for example, if the search of a very large sample has been split into many runs. In this case,
use the script PValue.py (see below) to compute p-values for all results using a common distribution.
Below is a list of all the columns and their meaning:
SpectrumFile - The file searched
Scan# - The scan number within the file; this value is 0 for .dta files
Annotation - Peptide annotation, with prefix and suffix and (non-fixed) modifications indicated.
Example: K.DFSQIDNAP+16EER.E
Protein - The name of the protein this peptide comes from. (Protein names are stored to the .index file
corresponding to the database .trie file)
Charge - Precursor charge. If "multicharge" is set, or if no charge is specified in the source file, Inspect
attempts to guess the charge.
MQScore - Match quality score, the main measure of match quality.
CutScore - Score for each cut point (between amino acids), based upon a Bayesian network modeling
fragmentation propensities
IntenseBY - Fraction of high-intensity peaks which are b or y fragments. For a length-n peptide, the top n*3
peaks are considered.
BYPresent - The fraction of b and y fragments which are observed. Fragments that fall outside the dynamic
range are not included in this count.
NTT - Number of tryptic termini (or Unused, if no protease was specified). Note that the N- and C-terminus of
a protein are both considered to be valid termini.
p-value - Probability that a match with this MQScore and DeltaScore is spurious; the lower this number, the
better the match. The p-value is calibrated by fitting the score distribution with mixture model; see also
"PValue.py", below.
DeltaScore - Difference between the MQScore of this match and the best alternative
DeltaScoreOther - Difference between the MQScore of this match and the best alternative from a different locus.
To see the difference between this and the previous column, consider a search that finds similar matches
of the form "M+16MALGEER" and "MM+16ALGEER". In such a case, DeltaScore would be very small, but DeltaScoreOther
might still be large.
RecordNumber - Index of the protein record in the database
DBFilePos - Byte-position of this match within the database
SpecFilePos - Offset, in the input file, of this spectrum; useful for passing to the "Label" script (see below)
PrecursorMZ - The precursor m/z given in the spectrum file.
PrecursorError - The difference (in m/z units) between the precursor m/z given in the file and the theoretical m/z of the identified peptide.
Post-processing
Python scripts for performing various analyses are included in the distribution.
Run a script with no command-line parameters to print a list of available arguments.
Label.py - Given a spectrum and a peptide annotation, label the spectrum peaks with
their associated fragments. Produces a .png image for a spectrum, with associated peptide interpretation. Requires
the Python Imaging Library (PIL). Sample command:
Label.py Shewanella.mzXML:6200392 R.SNGSIGQNQ+14TPGR.V
PValue.py - Given one or more Inspect output files, fit the distribution of scores with a mixture model,
similar to that used by PeptideProphet. The F-score used for modeling is a weighted sum of the MQScore, and
the ratio of the DeltaScoreOther to the average value DeltaScoreOther across the entire search.
(Note that DeltaScoreOther is highly dependent on database size, and taking the ratio helps correct
for this database-dependence). Inspect performs such a model fitting with each run, so running PValue.py is
generally not required. However, sometimes it makes sense to fit the same p-value distribution to many
search output files, and in that case, you use PValue.py to do so.
Summary.py - Given Inspect output, produce an html-format summary of the results. The report provides
a "protein-level" look at the results. This script is also used when
producing a "second-pass" protein database, containing the proteins identified with high confidence.
PTMChooser.py - This script examines output from MS-Alignment (Inspect run in "blind" mode), and
highlights the most plausible evidence for PTMs. The script iteratively selects the most common
post-translational modifications, and report the selections. These selections require manual curation
and/or validation.
PTMChooserLM.py - Performs the same task as PTMChooser, but with Low Memory usage. Useful
when analyzing millions of annotations at once.
For further details, consult the code-level documentation.