Meta-SPS
Download Publications Documentation
Contact: Adrian Guthals [aguthals (at) cs.ucsd.edu]
Summary
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our Shotgun Protein Sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But while SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Utilizing low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS datasets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Documentation (Meta-SPS)
Quick Start Instructions
See the "Docs" directory in the download package for full
documentation on executing the binaries and interpreting results. A
heavily summarized version is given here, as well as
default parameter files that should be used for running MetaSPS
as described in the papers.
Execution: MetaSPS automatically generates a number of
statically-named sub-directories and files from where ever it is
run, so it is best to run from a clean directory by calling:
<installation dir>/bin/main_specnets <params file> [OPTIONS]
Call "<installation dir>/bin/main_specnets --help" for a list of
available options. Besides invoking the "-lf" and "-ll" parameters
for controlling log output, remaining default parameter values will
run MetaSPS from start to finish. Input parameter files can further
control execution, but the attached "MetaSPS_*" files contain all
the default parameter values necessary so all that should be needed
is to specify the path to the installation directory, input MS/MS
spectra, database of contaminant and (optionally) homologous
proteins, and peak/parent mass tolerances. Please use the
appropriate default parameter file for the type of input spectra:
"MetaSPS_IT_CID" is for CID MS/MS spectra where MS/MS fragments were
collected in the Ion Trap (low-res), "MetaSPS_FT_CID" is for CID
MS/MS spectra where MS/MS fragments were collected in the Orbitrap
(high-res), and "MetaSPS_FT_HCD" is for HCD MS/MS spectra.
'MetaSPS_FT_CID_HCD_ETD" is for processing paired or triplet
CID/HCD/ETD scans as described in the JPR paper (see section
"CID/HCD/ETD").
Output: See the installation package for how to read the suite of
output html reports. However, one can also convert the set of
meta-contig de novo sequences to MGF format (as PRM spectra) by
issuing the following command from the project directory with the
supplied parameters file. The file "meta_contigs.mgf" will be
output to the same directory.
<installation dir>/bin/main_execmodule ExecMergeConvert convertContigs.params
Execution of a SGE grid: If the number of input MS/MS spectra is much higher than 20,000, SPS may take days to compute the all-to-all alignment of spectra on a single thread. See the release documentation for how to run this step in parallel if you also have SGE installed on your system. If SGE is not available, you can also add the parameter PARTIAL_OVERLAPS=0 to the parameters file, which may sacrifice de novo sequencing coverage/length to reduce running time by orders of magnitude and make it more feasible to run the alignment step on a single thread.
CID/HCD/ETD
If you have paired CID/ETD, HCD/ETD, or triplet CID/HCD/ETD, the first step is to use the MetaSPS_FT_CID_HCD_ETD.params file. Inside that file, there is a parameter "NUM_CONSECUTIVE", which must be set to 2 for paired CID/ETD or HCD/ETD, and set to 3 for triplet CID/HCD/ETD. The most critical aspect of processing this type of data is that the input spectra have properly assigned activation fields so MetaSPS knows which spectra are CID, HCD, or ETD. If you have mzXML-formatted spectra that are directly converted from .RAW (or some other vendor-specific format), those fields should be set. But if you inspect the mzXMLs and you do not see activationMethod set for ALL MS/MS scans (activationMethod="CID", activationMethod="HCD", activationMethod="ETD"), then you can add in the fields yourself or use MGF format. If you use MGF format, then each spectrum must have a parameter named ACTIVATION set to "CID", "HCD", or "ETD" somewhere after each BEGIN IONS:
BEGIN IONS
PEPMASS=617.80536
CHARGE=2+
MSLEVEL=2
ACTIVATION=ETD
TITLE=Scan Number: 2
166.530609 171.586746
171.898941 194.881088
...
END IONS
Deployment Instructions
Unzip the Spectral Networks package to a directory <installation directory> (any directory of your choice). This directory should then contain the following directories:
- sps/bin - Contains the binary executables
- sps/cgi - Contains CGI scripts used by the program
- sps/Doc - Documentation
- sps/example - Test project
Web Server
SPS is a command line tool that outputs reports in HTML format.
Report pages may be accessed using a web browser to render the generated
HTML report files. These files should be made available by SPS using
a web server such as Apache.
To enable interactivity in protein sequencing reports (see results
documentation), there are several CGI scripts needed that should be
included in the web server's configuration file:
- <installation directory>/sps/cgi/specplot.cgi
- <installation directory>/sps/cgi/contplot.cgi
- <installation directory>/sps/cgi/spsReports.cgi
Configuration
The following changes should be made in the installed scripts:
-
Edit <installation directory>/cgi/spsReports.cgi at line 12. The line should be:
- $ENV{'LD_LIBRARY_PATH'} = "<installation directory>/sps/bin/libs";
-
Edit <installation directory>/cgi/spsReports.cgi at line 8. The line should be:
- $SPS_DIR = "<installation directory>/sps/";
-
Edit <installation directory>/cgi/specplot.cgi at line 32. The line should be:
- $ENV{'LD_LIBRARY_PATH'} = "<installation directory>/sps/bin/libs";
-
Edit <installation directory>/cgi/specplot.cgi at line 27. The line should be:
- $TMP = "<TMP_DIRECTORY>";
where TMP_DIRECTORY is a directory in the file system where the server process has write permissions
- $TMP = "<TMP_DIRECTORY>";
-
Edit <installation directory>/cgi/specplot.cgi at line 28. The line should be:
- $SPS_DIR = "<installation directory>/sps/";
-
Edit <installation directory>/cgi/contplot.cgi at line 32. The line should be:
- $ENV{'LD_LIBRARY_PATH'} = "<installation directory>/sps/bin/libs";
-
Edit <installation directory>/cgi/contplot.cgi at line 27. The line should be:
- $TMP = "<TMP_DIRECTORY>";
where TMP_DIRECTORY is a directory in the file system where the server process has write permissions
- $TMP = "<TMP_DIRECTORY>";
-
Edit <installation directory>/cgi/contplot.cgi at line 28. The line should be:
- $SPS_DIR = "<installation directory>/sps/";
Testing the Installation
SPS installation may be tested by downloading the SPS test package.
- Download the package
- Unzip it to <installation directory>
- Open a shell
- cd to <installation directory>/sps/example
- edit the sps.params file.
- EXE_DIR should point to <installation directory>/sps/bin (should be an absolute path).
- REPORT_DIR defines the output directory for report files, should be in the webserver path, allowing for report pages to be served by the webserver (e.g. Apache).
- GRID_SGE_EXE_DIR DIR should point to where SGE binaries are located (qstat, qsub, etc.).
- GRID_EXE_DIR should point to where SPS binaries (the same pointed by EXE_DIR) are seen on SGE.
-
REPORT_SERVER should point to the server's CGI directory. Example:
REPORT_SERVER=http://myserver.com/cgi-bin/ - REPORT_DIR_SERVER should point to the project directory on the Server.
- run go.sh on linux systems or go.bat in windows systems.
- From a webserver, open '<URL path in webserver>/index.html' which is located inside the specified report location directory, considering your webserver path specifications. The report initial page should be displayed.
Sample Report
6-prot_meta-contigsDownloads
Installation Packages
- Linux 32-bit (72.6 MB)
- Linux 64-bit (73.3 MB)
- Windows 32/64-bit (97.8 MB)
Publications
Shotgun protein sequencing with meta-contig assembly
Guthals A, Clauser KR, Bandeira N
Molecular & Cellular Proteomics (2012) 10:1084.96.
Sequencing-Grade De novo Analysis of MS/MS Triplets (CID/HCD/ETD)
From Overlapping Peptides
Guthals A, Clauser KR, Frank AM, Bandeira N
J Proteome Res. 2013 Jun 7;12(6):2846-57. doi: 10.1021/pr400173d.