GenoMS : Sequencing whole proteins using a template databases


Usage Instructions

java -jar GenoMS.jar
Required Parameters:
    -i [FILE] The input configuration file.  More details can be found below
    -o [FILE] The output file
    
Optional Parameters:
    -r [DIR] The resource directory to look for things like DBs
    -k [DIR] Project directory to write all output
    -e [DIR] An execution directory where all exectubles can be found (e.g. InsPecT, MSGF.jar, PepNovo).  This is only
            needed if the full path is not provided in the configuration file.
    -x Force a rerun database searches
    -p Use a missing peak penalty when scoring HMM Match states
    -s Require all2all similarity for Multiple Spectrum Alignment
    -a Continue to add eligible match states, even after all spectra are added to HMM
    -w [NUM] FDR cutoff for peptide-spectrum matches 
    -f Search database allowing mutations
    -l [FILE] Write output to a log file (default is to stdout)
    

The configuration file

The configuration file contains all of the information needed for GenoMS to run.
The lines of the must obey the following rules:

Required Lines:
inspectspectra,SPECTRUM_FILE,INSPECT_INPUT_FILE
OR
spectra,SPECTRUM_FILE
You may use either InsPecT or MSGF-DBto perform peptide identification. If you are using InsPecT, you must use the
inspectspectra
parameter and specify both the full path to the file and the
path to the input file to InsPecT for that file. See Preprocessing
for more information on creating this file. If you are using MSGF-DB, then use the
spectra
parameter, and you need only specify the full path to the spectra.
prms,PRM_FILE_NAME
OR
pepnovoexec,PEPNOVO_EXECUTABLE
AND
pepnovomodeldir,PEPNOVO_MODEL_DIR
For each PRM file created by PepNovo, you must specify the full path to the file.
For more information, see Preprocessing. Alternatively,
you can specify the path to PepNovo, and GenoMS will generate the PRMs using the
default parameters and those specified in the config file. Specifically, files
are assumed to be from an LTQ machine, and the digest is inferred from the spectrum
file name (must include 'tryp' or 'trypsin' to be identified as a tryptic digest).
inspectpath,PATH_INSPECT_EXECUTABLE
OR
msgfdbpath,PATH_TO_MSGFDB_JAR_FILE
If you plan to use InsPecT for peptide identification, you must compile InsPecT and provide the full path to the InsPecT executable. Alternatively, you may use MSGF-DB and must provide the full path to the MSGF-DB jar file. Depending on which database search tool you plan to use, the spectrum parameter will change.
genomeseq,GENOME_FILE
OR
dbrootname,ROOT_NAME
OR
templateconstraintfile,FILE_NAME
AND
dbcombined,DB_FILE_NAME
One of three different types of databases must be specified; a genome sequence,
the root name of a set of sequence files for further construction, or a fully
constructed template database file. If a fully constructed template database is
provided, the constraint file is also needed. For more details about these
database types see Template Databases


Optional lines:
contaminants,COMMON_CONTAMINANTS_FILE
A database file (FASTA format) of common contaminants.
fixedmod,AA,MASS
A modification which occurs on all of the specified amino acids
(e.g. fixedmod,C,57)
msgfdir,DIR
The directory containing the MSGF.jar file for rescoring using MSGF. MSGF is only used with running InsPecT. It should not be used with MSGF-DB.
tolerance_pm,NUM
The parent mass tolerance in Daltons of the mass spectra (Default is 3.0 Da).
tolerance_peak,NUM
The fragment ion tolerance in Daltons of the mass spectra (Default is 0.5 Da).
digest,STRING
Specifies the protease used for digestion. Accepted case-insensitive values
are trypsin, chymotrypsin, other. If no digest is specified, then it is guessed
from the spectrum file name.

GenoMS contains a java script, CreateConfigFile.jar, to create the configuration
file automatically.
    java -jar CreateConfigFile.jar version 2010.12.07
    Creates the config file to be input to GenoMS given the set of spectra.
    -s [FILE or DIR] Spectrum file(s) to be used in the experiment
    -x [DIR] Path to InsPecT executable
    -o [FILE] Config file to write
    
Must include one of the following template database options (See Template Databases):
    -c [FILE] Combined database name (either FASTA or trie)
    AND
    -t [FILE] Template constraint file
    
    -n [DBName] Prefix name of the DB to be created
    
    -g [FILE] FASTA file containing genomic data to create a 6frame DB

    
Optional parameters:
    (-d [FILE] Contaminants file)
    (-m [STRING] Fixed modification of the form C+99, or M-16.  Can be used multiple
                times to specify multiple fixed modifications.  These are modifications which
                occur on all instances of the amino acid such as a cysteine protecting group.
    (-p [FILE or DIR] File or Directory containing the PRM files that have already been generated)
    (-i [FILE or DIR] Directory containing the InsPecT input files for the spectra)
    (-f [DIR] Model directory for PepNovo (Note: You must specify either the PepNovo directory and executable, or a directory of PRM spectra with -p))
    (-r [FILE] Executable for PepNovo
    (-q [DIR] Directory containing MSGF jar file
    (-a [NUM] Parent mass tolerance (Da) (Default 3.0 Da))
    (-e [NUM] Fragment mass tolerance (Da) Default 0.5 Da))
    (-h [trypsin/chymotrypsin/other] Enzyme used for digestion (Default: infer tryptic/non-tryptic from spectrum file names)
    

Details of optional parameters to GenoMS.jar

-r [DIR]
The resource directory to look for things like DBs
-x
Force a rerun Inspect searches. The default behavior is to only re-run
InsPecT if the results files are missing
-p
Use a missing peak penalty when scoring HMM Match states. The
penalty is equal to the log likelihood of the average PRM score.
-s
Require all2all similarity for Multiple Spectrum Alignment.
This will reduce the number of false positive extensions, but may
significantly reduce the final predicted protein length.
-a
Continue to add eligible match states, even after all spectra are
added to the HMM. This is a useful option if your spectral dataset is
fairly small and you expect little overlap of peptides.
-w [NUM]
Peptide-spectrum match false discovery rate (FDR) cutoff for InsPecT
results (or rescored MS-GF results). Default cut-off is 0.01
-f
Search database allowing a single amino acid mutation per peptide