GenoMS : Sequencing whole proteins using a template databases

Usage Instructions

java -jar GenoMS.jar

Required Parameters:

    -i [FILE] The input configuration file.  More details can be found below
    -o [FILE] The output file

Optional Parameters:

    -r [DIR] The resource directory to look for things like DBs
    -k [DIR] Project directory to write all output
    -e [DIR] An execution directory where all exectubles can be found (e.g. InsPecT, MSGF.jar, PepNovo).  This is only
            needed if the full path is not provided in the configuration file.
    -x Force a rerun database searches
    -p Use a missing peak penalty when scoring HMM Match states
    -s Require all2all similarity for Multiple Spectrum Alignment
    -a Continue to add eligible match states, even after all spectra are added to HMM
    -w [NUM] FDR cutoff for peptide-spectrum matches 
    -f Search database allowing mutations
    -l [FILE] Write output to a log file (default is to stdout)

The configuration file

The configuration file contains all of the information needed for GenoMS to run.
The lines of the must obey the following rules:

Required Lines:

inspectspectra,SPECTRUM_FILE,INSPECT_INPUT_FILE

spectra,SPECTRUM_FILE

You may use either InsPecT or MSGF-DBto perform peptide identification. If you are using InsPecT, you must use the

inspectspectra

parameter and specify both the full path to the file and the
path to the input file to InsPecT for that file. See Preprocessing
for more information on creating this file. If you are using MSGF-DB, then use the

spectra

parameter, and you need only specify the full path to the spectra.

prms,PRM_FILE_NAME

pepnovoexec,PEPNOVO_EXECUTABLE

AND

pepnovomodeldir,PEPNOVO_MODEL_DIR

For each PRM file created by PepNovo, you must specify the full path to the file.
For more information, see Preprocessing. Alternatively,
you can specify the path to PepNovo, and GenoMS will generate the PRMs using the
default parameters and those specified in the config file. Specifically, files
are assumed to be from an LTQ machine, and the digest is inferred from the spectrum
file name (must include 'tryp' or 'trypsin' to be identified as a tryptic digest).

inspectpath,PATH_INSPECT_EXECUTABLE

msgfdbpath,PATH_TO_MSGFDB_JAR_FILE

If you plan to use InsPecT for peptide identification, you must compile InsPecT and provide the full path to the InsPecT executable. Alternatively, you may use MSGF-DB and must provide the full path to the MSGF-DB jar file. Depending on which database search tool you plan to use, the spectrum parameter will change.

genomeseq,GENOME_FILE

dbrootname,ROOT_NAME

templateconstraintfile,FILE_NAME

AND

dbcombined,DB_FILE_NAME

One of three different types of databases must be specified; a genome sequence,
the root name of a set of sequence files for further construction, or a fully
constructed template database file. If a fully constructed template database is
provided, the constraint file is also needed. For more details about these
database types see Template Databases

Optional lines:

contaminants,COMMON_CONTAMINANTS_FILE	A database file (FASTA format) of common contaminants.
fixedmod,AA,MASS	A modification which occurs on all of the specified amino acids (e.g. fixedmod,C,57)
msgfdir,DIR	The directory containing the MSGF.jar file for rescoring using MSGF. MSGF is only used with running InsPecT. It should not be used with MSGF-DB.
tolerance_pm,NUM	The parent mass tolerance in Daltons of the mass spectra (Default is 3.0 Da).
tolerance_peak,NUM	The fragment ion tolerance in Daltons of the mass spectra (Default is 0.5 Da).
digest,STRING	Specifies the protease used for digestion. Accepted case-insensitive values are trypsin, chymotrypsin, other. If no digest is specified, then it is guessed from the spectrum file name.

GenoMS contains a java script, CreateConfigFile.jar, to create the configuration
file automatically.

    java -jar CreateConfigFile.jar version 2010.12.07
    Creates the config file to be input to GenoMS given the set of spectra.
    -s [FILE or DIR] Spectrum file(s) to be used in the experiment
    -x [DIR] Path to InsPecT executable
    -o [FILE] Config file to write

Must include one of the following template database options (See Template Databases):

    -c [FILE] Combined database name (either FASTA or trie)
    AND
    -t [FILE] Template constraint file
    
    -n [DBName] Prefix name of the DB to be created
    
    -g [FILE] FASTA file containing genomic data to create a 6frame DB

Optional parameters:

    (-d [FILE] Contaminants file)
    (-m [STRING] Fixed modification of the form C+99, or M-16.  Can be used multiple
                times to specify multiple fixed modifications.  These are modifications which
                occur on all instances of the amino acid such as a cysteine protecting group.
    (-p [FILE or DIR] File or Directory containing the PRM files that have already been generated)
    (-i [FILE or DIR] Directory containing the InsPecT input files for the spectra)
    (-f [DIR] Model directory for PepNovo (Note: You must specify either the PepNovo directory and executable, or a directory of PRM spectra with -p))
    (-r [FILE] Executable for PepNovo
    (-q [DIR] Directory containing MSGF jar file
    (-a [NUM] Parent mass tolerance (Da) (Default 3.0 Da))
    (-e [NUM] Fragment mass tolerance (Da) Default 0.5 Da))
    (-h [trypsin/chymotrypsin/other] Enzyme used for digestion (Default: infer tryptic/non-tryptic from spectrum file names)

Details of optional parameters to GenoMS.jar

-r [DIR]	The resource directory to look for things like DBs
-x	Force a rerun Inspect searches. The default behavior is to only re-run InsPecT if the results files are missing
-p	Use a missing peak penalty when scoring HMM Match states. The penalty is equal to the log likelihood of the average PRM score.
-s	Require all2all similarity for Multiple Spectrum Alignment. This will reduce the number of false positive extensions, but may significantly reduce the final predicted protein length.
-a	Continue to add eligible match states, even after all spectra are added to the HMM. This is a useful option if your spectral dataset is fairly small and you expect little overlap of peptides.
-w [NUM]	Peptide-spectrum match false discovery rate (FDR) cutoff for InsPecT results (or rescored MS-GF results). Default cut-off is 0.01
-f	Search database allowing a single amino acid mutation per peptide