MS-GF

Contacts

Sangtae Kim [sak008 (at) ucsd.edu]

Summary

A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to delta-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of `one-hit-wonders’ in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.

Documentation

  • Introduction
    MS-GF is a software to generate rigorous p-values (spectral probabilities) of spectral interpretations. Given peptide-spectrum matches (e.g. InsPecT result) as input, MS-GF outputs p-values that can be used to discriminate correct and false identifications. Currently, MS-GF supports InsPecT output and simplified version of InsPecT output (described below).
  • Installation
    Download MSGF.jar and place it in any folder. JRE 1.6 or greater must be installed to run MS-GF.
  • Run
    usage: java -Xmx2000M -jar MSGF.jar
        -i ResultFile
        -d SpecDir
        [-o OutputFileName] (Default: stdout)
        [-m FragmentationMethod 0/1] (0: CID (default), 1: ETD)
        [-e Enzyme 0/1/2/3/4/5/6/7] (0: No enzyme, 1: Trypsin (default),
            2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C,
            7: Asp-N)
        [-fixMod 0/1/2] (0: NoCysteineProtection,
            1: CarbamidomethyC (default), 2: CarboxymethylC)
        [-aaSet AASetFileName (default: standard amino acids)]
        [-x 0/1] (0: all (default), 1: OnePerSpec)
        [-p SpecProbThreshold] (Default: 1)
        [-param ScoringParamFile]

    – “-Xmx2000M”: specify maximum heap size allowed to JVM. For MS-GF, it is recommended to use more than 1000M.

    – ResultFile: tab-delimited file containing Peptide-Spectrum matches. InsPecT output file may be used as it is. To use it with other tools (e.g. Mascot or SEQUEST), one must make a “simplified InsPecT result file” using the following rules:

    1. The file must be tab-delimited.
    2. The first line must be the header starting with “#” and include #SpectrumFile, Scan#, Annotation and Charge. It may have other (optional) fields.
    3. Each line (other than the first line) represents a peptide-spectrum match.
    4. #SpectrumFile represents the spectrum file name. Only the file name (not the full path) will be recognized and later be searched in the directory specified with “-d” option. Currently mzXML (*.mzXML), mgf (*.mgf), ms2 (*.ms2) and pkl (*.pkl) formats are supported.
    5. Scan# represents the spectrum position in the file. In the mzXML format, it must be the scan number. For all other formats, it is the one-based sequential number of the spectrum in the file (e.g. the first spectrum in a mgf file will have value 1).
    6. Annotation represents the peptide identification. It must have flanking amino acids delimited by ‘.’. To represent the start/end position of a protein, ‘*’ (or any non-alphabetic character) must be used (e.g. *.ACDEFGHIR.A).
    7. Charge represents the precursor charge of the spectrum. It must be positive integer.
    =========
    Example input file
    #SpectrumFile   Title   Scan#   Annotation      Charge  Protein
    /data/spec1.mgf	Frac1	456	R.PEPTIDEK.G	2	Prot1
    # Lines starting with # will be ignored.
    spec2.mzXML	Frac2	512	*.ACDEFGHIR.A	3	Prot2
    ...
    =========

    – SpecDir: the directory path (absolute or relative) that contains all the spectrum files.

    – OutputFileName: output file name. MS-GF will add “SpecProb” column. If not specified, output will be sent to stdout. SpecProb times the number of residues in the database (you have used to get the PSM) approximates the E-value.

    =========
    Example output:
    #SpectrumFile   Title   Scan#   Annotation      Charge  Protein	SpecProb
    /data/spec1.mgf	Frac1	456	R.PEPTIDEK.G	2	Prot1	1.23E-4
    spec2.mzXML	Frac2	512	*.ACDEFGHIR.A	3	Prot2	4.12E-12
    =========

    – FragmentationMethod: fragmentation method. Scoring parameters will be chosen depending on this.

    – Enzyme: enzyme used.

    – “-fixMod”: fixed modification.

    – AASetFileName: a file containing amino acid information (e.g. see “AASetC57.txt”).

    – “-x”: if “-x 1” is specified, only the best interpretation per spectrum will be printed.

    – “-p SpecProbThreshold”: if specified, only PSMs with equal or less than “SpecProbThreshold” will be printed.

    – ScoringParamFile: path to the custom scoring parameter file. See below on how to make scoring parameter files.

  • Generating a scoring parameter file
    usage: java -Xmx2000M -cp MSGF.jar msscorer.ScoringParameterGenerator
        -i AnnotatedMgf (*.mgf)
        -o OutputParamFile (e.g. CID_Tryp.param)
        [-fixMod 0/1/2] (0: NoCysProtection, 1:CarbamidomethylC (default),
            2: CarboxymethylC)

    – AnnotatedMgf: mgf file containing annotations as “SEQ=” fields.

    ==========
    Example:
    BEGIN IONS
    TITLE=022008_F11_ETD_1.mzXML:5111 interprophet=1.0
    SEQ=GHGFEGVTHRWGTK
    PEPMASS=523.59644
    SCANS=5111
    CHARGE=+3
    143.3189        5.750927
    158.71844       1.6126348
    169.16019       2.5660455
    ....
    END IONS
    ==========

    – OutputParamFile: output parameter file name. The resulting parameter file can be used with “-param” option when running MS-GF.

    – “-fixMod”: fixed modification.