Contacts
Sangtae Kim [sak008 (at) ucsd.edu]Summary
A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to delta-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of `one-hit-wonders’ in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
Documentation
- Introduction
MS-GF is a software to generate rigorous p-values (spectral probabilities) of spectral interpretations. Given peptide-spectrum matches (e.g. InsPecT result) as input, MS-GF outputs p-values that can be used to discriminate correct and false identifications. Currently, MS-GF supports InsPecT output and simplified version of InsPecT output (described below). - Installation
Download MSGF.jar and place it in any folder. JRE 1.6 or greater must be installed to run MS-GF. - Run
usage: java -Xmx2000M -jar MSGF.jar -i ResultFile -d SpecDir [-o OutputFileName] (Default: stdout) [-m FragmentationMethod 0/1] (0: CID (default), 1: ETD) [-e Enzyme 0/1/2/3/4/5/6/7] (0: No enzyme, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N) [-fixMod 0/1/2] (0: NoCysteineProtection, 1: CarbamidomethyC (default), 2: CarboxymethylC) [-aaSet AASetFileName (default: standard amino acids)] [-x 0/1] (0: all (default), 1: OnePerSpec) [-p SpecProbThreshold] (Default: 1) [-param ScoringParamFile]
– “-Xmx2000M”: specify maximum heap size allowed to JVM. For MS-GF, it is recommended to use more than 1000M.
– ResultFile: tab-delimited file containing Peptide-Spectrum matches. InsPecT output file may be used as it is. To use it with other tools (e.g. Mascot or SEQUEST), one must make a “simplified InsPecT result file” using the following rules:
- The file must be tab-delimited.
- The first line must be the header starting with “#” and include #SpectrumFile, Scan#, Annotation and Charge. It may have other (optional) fields.
- Each line (other than the first line) represents a peptide-spectrum match.
- #SpectrumFile represents the spectrum file name. Only the file name (not the full path) will be recognized and later be searched in the directory specified with “-d” option. Currently mzXML (*.mzXML), mgf (*.mgf), ms2 (*.ms2) and pkl (*.pkl) formats are supported.
- Scan# represents the spectrum position in the file. In the mzXML format, it must be the scan number. For all other formats, it is the one-based sequential number of the spectrum in the file (e.g. the first spectrum in a mgf file will have value 1).
- Annotation represents the peptide identification. It must have flanking amino acids delimited by ‘.’. To represent the start/end position of a protein, ‘*’ (or any non-alphabetic character) must be used (e.g. *.ACDEFGHIR.A).
- Charge represents the precursor charge of the spectrum. It must be positive integer.
========= Example input file #SpectrumFile Title Scan# Annotation Charge Protein /data/spec1.mgf Frac1 456 R.PEPTIDEK.G 2 Prot1 # Lines starting with # will be ignored. spec2.mzXML Frac2 512 *.ACDEFGHIR.A 3 Prot2 ... =========
– SpecDir: the directory path (absolute or relative) that contains all the spectrum files.
– OutputFileName: output file name. MS-GF will add “SpecProb” column. If not specified, output will be sent to stdout. SpecProb times the number of residues in the database (you have used to get the PSM) approximates the E-value.
========= Example output: #SpectrumFile Title Scan# Annotation Charge Protein SpecProb /data/spec1.mgf Frac1 456 R.PEPTIDEK.G 2 Prot1 1.23E-4 spec2.mzXML Frac2 512 *.ACDEFGHIR.A 3 Prot2 4.12E-12 =========
– FragmentationMethod: fragmentation method. Scoring parameters will be chosen depending on this.
– Enzyme: enzyme used.
– “-fixMod”: fixed modification.
– AASetFileName: a file containing amino acid information (e.g. see “AASetC57.txt”).
– “-x”: if “-x 1” is specified, only the best interpretation per spectrum will be printed.
– “-p SpecProbThreshold”: if specified, only PSMs with equal or less than “SpecProbThreshold” will be printed.
– ScoringParamFile: path to the custom scoring parameter file. See below on how to make scoring parameter files.
- Generating a scoring parameter file
usage: java -Xmx2000M -cp MSGF.jar msscorer.ScoringParameterGenerator -i AnnotatedMgf (*.mgf) -o OutputParamFile (e.g. CID_Tryp.param) [-fixMod 0/1/2] (0: NoCysProtection, 1:CarbamidomethylC (default), 2: CarboxymethylC)
– AnnotatedMgf: mgf file containing annotations as “SEQ=” fields.
========== Example: BEGIN IONS TITLE=022008_F11_ETD_1.mzXML:5111 interprophet=1.0 SEQ=GHGFEGVTHRWGTK PEPMASS=523.59644 SCANS=5111 CHARGE=+3 143.3189 5.750927 158.71844 1.6126348 169.16019 2.5660455 .... END IONS ==========
– OutputParamFile: output parameter file name. The resulting parameter file can be used with “-param” option when running MS-GF.
– “-fixMod”: fixed modification.