Enabling High-Troughput Big Data Proteomics
Decoding Proteomics Big Data
We face a “Big data, many tools, no solution” conundrum in the MS community. New analysis tools are continually being developed and made available, including by us at CCMS. However, the vast majority of these tools require an extensive series of manual steps to translate data into meaningful results. As such, many smaller labs are forced to forfeit innovative tools in favor of stable software platforms with limited search capabilities. The CCMS ProteoSAFe platform is designed to meet this challenge by delivering a computational framework designed for easy integration of new tools (Flexibility) which can be easily accessed (Accessibility) and executed on laptops or large compute cluster (Scalability).
In addition to software solutions, there is also a pressing need for structured access and reuse of raw data. Today, a researcher analyzing the human kidney proteome at Harvard does not benefit from spectra released by the MIT researcher who might also be analyzing the human (let alone, mouse) kidney proteome. Even the simple question of whether a spectrum (identified or not) has been seen before (and under what circumstances) cannot be easily answered today. This widespread introversion is troublesome. Consider genomics, where Genbank and other databases link every gene sequence to all publications that make use of the sequence. Similarly, we could potentially annotate each spectrum with knowledge from all laboratories that generated it. However, there are few solutions available to capitalize on the value of “Big Data” in proteomics. The CCMS MassIVE repository is designed to meet this challenge by providing a Petabyte-scale platform for mass spectrometry data sharing. The MassIVE repository already enables the sharing of multiple Terabytes of mass spectrometry data, including all the public data that could be recovered from the now-extinct Tranche repository.

