MSACL 2018 US Abstract

Topic: Data Science

Community-Scale Translation of Mass Spectrometry Big Data into Crowdsourced Proteomics and Metabolomics Resources

Nuno Bandeira (Presenter)
University of California, San Diego

Bio: Nuno Bandeira received his B.S. in Computer Science (1997); New University of Lisbon, Portugal; M.Sc. in Applied Artificial Intelligence (2001); New University of Lisbon, Portugal; Ph. D. in Computer Science and Bioinformatics (2007); University of California, San Diego. Awards include 2006 Human Proteome Organization (HUPO) Young Investigator Award, 2007 Ph.D. Dissertation Award (CSE/UCSD), 2010 Genome Technology’s Tomorrow’s PI, 2012 Molecular BioSystems’s Emering Investigator and 2013 Sloan Research Fellowship. Nuno Bandeira, Ph.D. is an Associate Professor of Computer Science and Engineering at the Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego is also the founding and current Executive Director of the NIH/NIGMS Center for Computational Mass Spectrometry at UCSD. His lab’s research focuses on big data algorithms and systems for worldwide interpretation of proteomics and metabolomics mass spectrometry data, including data from endogenous and digested peptides, discovery and localization of post-translational modifications, protein-protein interactions, sequencing of non-linear peptides with unknown amino acids and characterization of microbial, marine, reptile and plant natural products. As Executive Director of UCSD Center for Computational Mass Spectrometry, his research further extends to distributed algorithms for large scale data analysis (ProteoSAFe), sharing (MassIVE) and crowdsourced, community-wide interpretation (GnPS) of all publicly available mass spectrometry data.

Authorship: Mingxun Wang(1,2), Laurence Bernstein(1,2), Julie Wertz(1,2), Seungjin Na(1,2), Jeremy Carver(1,2), Nuno Bandeira(1,2)
(1) Center for Computational Mass Spectrometry, (2) University of California, San Diego

Short Abstract

Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics and metabolomics events requires the development of novel community-scale computational workflows. We show how advanced distributed computing algorithms can be used to process tens of terabytes of public data to reveal hundreds of millions of new identifications, including the discovery of novel proteins and hypermodified peptides with over 100 modification variants. We further show how large scale reanalyses can be reliably aggregated into community-scale crowdsourced spectral libraries enabling high-throughput detection and discovery in newly acquired data.

Long Abstract

Introduction

The overwhelming majority of the human proteomics mass spectrometry data deposited in public repositories has either no associated annotations or has no statistical controls on the reliability of the data analysis. As such, most of the deposited data is difficult to access for most mass spectrometry experts and is nearly inaccessible for most biologists. To address this pressing need to make mass spectrometry big data both more accessible (to assess previous claims) and more reusable (to enable new discoveries), the Mass spectrometry Interactive Virtual Environment (MassIVE, http://massive.ucsd.edu) approaches these challenges using distributed computing ProteoSAFe workflows (http://proteomics.ucsd.edu/ProteoSAFe) and crowdsourced curation interfaces such as the GNPS platform (http://gnps.ucsd.edu) for analysis of metabolomics and natural products mass spectrometry data.

Methods

We show how to expand coverage of the human proteome and metabolome in three ways. First, public data becomes readily searchable with a variety of algorithms such as spectral library search and proteogenomics. Second, spectral alignment algorithms reveal numerous unexpected post-translational modifications and unidentified connections between datasets. Third, we show how the resulting knowledge can be aggregated in an open platform and easily accessible to verification and reutilization by the whole community.

Results

First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 85% of public human HCD data.

Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data.

Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at ~1.3 variants per protein region) is far below what is observable in experimental data.

Finally, we present an open platform for community sharing of MS data and knowledge. Building on large-scale automated reanalysis and distributed computing workflows designed to integrate public and private data, we show how thousands of users from over 100 countries have embraced open knowledge to analyze billions of spectra from millions of mass spectrometry runs, thereby enabling the concept of ‘living data’ and improving identification rates by over 20-fold.

Conclusions & Discussion

Transforming MS big data into ‘living data’ by automated integration of advanced algorithms and community contributions.


References & Acknowledgements:


Financial Disclosure

DescriptionY/NSource
Grantsno
Salaryyes
Board Memberno
Stockyes Digital Proteomics
Expensesno

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above:

no