MSACL 2016 EU Abstract

From Thousands of Mass Spectrometry Profiles to Biomarkers Within a Day ? Arion 4 Omics, a Novel Solution to Facilitate and Accelerate Omics-based Decisions

Dr Doroteya Staykova (Presenter)
Multicore Dynamics Ltd

Bio: With a background in physics, Dr. Staykova has over twenty years of experience in the development of bespoke software applications for high-throughput analysis of spectral data, generated by experimental techniques including NMR and Mass Spectrometry. Dr. Staykova has participated in a number of EU funded projects where she has worked on cutting-edge analytics for biomolecular research and Omics–based Medicine. Strong collaborations formed with academic, industrial and governmental bodies to facilitate their respective fields of research e.g. University of Cambridge, Bruker BioSpin GmbH, General Hospital and University of Southampton. As one of the directors of bioinformatics startup, Multicore Dynamics Ltd, Dr Staykova is currently working on the development of high performance, parallel based solutions for accelerating the path leading to new discoveries in health science.

Authorship: Doroteya K.Staykova, Matthew E. Lea,
Multicore Dynamics Ltd, UK

Short Abstract

In an era where data science, biology and chemistry have all become finely interlinked, where high-throughput instruments generate staggering amounts of omics data for the purpose of disease profiling and biomarker discovery, much has been written about the problems faced with the storing, processing and analysis of. Here we present an advanced and fully benchmarked, high performance computing solution to directly address key challenges surrounding the manipulation and analysis of large-scale proteomics data. We demonstrate how ‘Arion 4 Omics’ can analyse and discover biomarkers from many thousands of samples with the results delivered in less than a day.

Long Abstract

Introduction.

In this era of Terabytes to Zettabytes, where high-throughput instruments generate staggering amounts of omics data, a number of authors [1, 2] have addressed the problems faced with the storing, processing and analysis of. Indeed much has been written about the diverse range of multi-disciplinary skills necessary to see a clinical trial through from start to finish [2, 3].

Whilst there is a growing trend towards the profiling of disease using omics data generated from population based studies, the management, integration and interpretation of this data, together with escalating costs are recognised as being the main contributing factors that adversely affect the transition from conventional to personalised medicine [2, 3].

Recent advancements in liquid chromatography-tandem mass spectrometry (LC-MS/MS), enable short measurement times and high sensitivity for the detection of low-abundant biomolecules in clinically accessible fluids (e.g. blood, saliva, etc). This achievement led to the absolute quantification of an increased number of proteins in both simple and complex samples [4].

Large-cohort clinical studies can utilise such high-throughput technology to profile disease groups on a molecular level or for the discovery and validation of biomarkers [5, 6]. If therefore, the stratification of patients on a molecular level is the way forward, the real problem lies not in the acquisition or generation of the data, but in the accurate interpretation of.

Here we present an advanced and fully benchmarked, high performance computing solution to address a number of key challenges surrounding the manipulation and analysis of large-scale proteomics data. ‘Arion 4 Omics’ can analyse and discover biomarkers from many thousands of samples with the results delivered in less than a day.

Methods.

The process of generating large-scale proteomics datasets begins with data acquisition. This process can take an extended period of time depending upon availability of the clinical samples. The larger the clinical study, the larger the resulting volume of data which in turn requires quality checking, integration and transformation into ‘analysis ready’ datasets.

In order to provide an automated quality assessment of the data and subsequent transformations to support clinicians and researchers in their quest to discover biomarkers, ‘Arion 4 Omics’ combines domain expertise with several proven and tested technologies for large scale data manipulation. The following are integral components of the ‘Arion 4 Omics’ analytical platform, here after defined as ‘pillars’.

Pillar 1: Database Technologies.

Large clinical trials inevitably create large volumes of data. In order to efficiently manage this data and permit the systematic analysis and conversion into applied knowledge, requires the introduction of a suitable repository.

‘Arion 4 Omics’ employs advanced database technologies that enable the parallel processing of key database operations. This translates into ultra-rapid queried results to facilitate and accelerate the process of data exploration. All user interactions are recorded as ‘system events’ stored in an auditing database. This information is designed to adhere with stringent security protocols or simply co-operate with a client’s security and auditing standards.

As well as being performance driven, the ‘Arion 4 Omics’ database offer scalability and flexibility together with data integrity built-in as standard, all geared towards big data analytics.

Pillar 2: Highly Parallel Architecture.

In order to access and process data set information both reliably and quickly, ‘Arion 4 Omics’ employs scalable, parallel architecture containing thousands of processing cores. Novel domain-specific algorithms have been exhaustively benchmarked to ensure a high accuracy of repeatable results, consistent performance and their ability to provide maximum insight into the data. Most are designed to execute in parallel in order to achieve full use of the processing power available.

In addition, ‘Arion 4 Omics’ incorporates additional performance based hardware technologies to maximise the transfer of data between storage, memory and processor.

Pillar 3: Pipeline.

Mass spectrometry data analysis entails a long pipeline where the conditions under which the data is first acquired, can have a significant impact on biomarker delivery. Removal of ‘platform’ specific sources of variability, such as systematic biases are critical steps in the processing chain.

In addition, a high proportion of missing values can impose a big challenge to the statistical analysis of mass spectrometry-based data. Therefore, the careful selection of normalization and imputation methods, dealing with data bias and missing values are critical steps before the data can be subjected to further analysis [7].

The cleaning, slicing and dicing of mass spectrometry data to extract biomarkers is typically performed by computer scientists, skilled in bioinformatics. We have taken these challenging steps and created a computer-assisted, transformation pipeline focused on fast and error-free data manipulations that are critically important [2].

A modular approach in the underlying core design provides a user-defined and flexible approach to the exploration of complex data sets. Increased performance may be obtained through the implementation of additional hardware and increased functionality through the addition of machine learning and statistics modules.

Pillar 4: Costs.

Cost optimisation is achieved through the integration of performance based hardware and software technologies together with domain expertise (pillars 1-3).

Whilst highly skilled resources are an integral and essential part in any clinical study, ‘Arion 4 Omics’ is a cost effective solution that will complement existing skills and support research teams by providing functionality to overcome many of the challenges typically encountered during a research project.

Results.

Initial testing using thousands of proteomics samples, has consistently shown that processed datasets and biomarkers can be delivered in less than a day.

For benchmarking, 50,000 samples of artificial protein data (ProteinLynx Global Server (PLGS) format) [8] were generated from an entire set of 152,493 human proteins available in the UniProt database [9] as of April, 2016.

The resulting data was then imported and performance measures collected for pipeline modules, e.g. data merging, normalization, imputation etc. Execution times were observed for both intensive and less intensive database and processing operations with results ranging from a couple of minutes to hours (< 24 hrs).

It should be noted that all benchmarks were performed on a prototype system running an entry level configuration. The production model will see lower benchmark times as the hardware configuration will be of a higher specification.

Whilst the test results were generated using proteomics data, the Arion pipeline is fully adaptable, accepting other types of omics data through the introduction of instrument-specific plugins.

Conclusions.

In order for translational medicine to become a viable and practical reality, researchers and analysts require the seamless integration of advanced technologies to support the analysis of increasingly large datasets with near instantaneous, quantitative results.

We present an intuitive and future proof platform, that effectively manages the challenging ‘data science’ aspect of the analysis pipeline.

‘Arion for Omics’ incorporates carefully selected technologies and offers supercomputing power in a sensibly priced, integrated solution. It is a state of the art, pre-processing and analysis pipeline designed for managing both small and large scale, structurally complex, omics-based datasets.

Using 50,000 samples, our benchmarked results show that initial ‘bio-marker’ discovery can be achieved in less than a day. This progressive solution will aid the development of personalised medicine in Europe and encourage data driven clinical decision making that until now, has largely been held back.

Arion 4 Omics provides –

* Flexible choice of pre-defined processing steps.

* An integrated system bridging the gap between individual samples and large sample sets ready for mining.

* Simple and intuitive process to update and re-process existing sample sets with further samples.

* Database-driven integration of clinical and omics data.

* Reliance and performance driven platform.

* Full traceability of processes and actions performed – system auditing.

* Automated data checks.

* Reproducibility of results.


References & Acknowledgements:

[1] Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7):e1002195. doi:10.1371/journal.pbio.1002195

[2] Alyass A, Turcotte M, Meyre D. (2015) From big data analysis to personalized medicine for all: challenges and opportunities. BMC Medical Genomics 8(33) doi:10.1186/s12920-015-0108-y

[3] Mardis ER (2010) The $1,000 genome, the $100,000 analysis?. Genome Medicine, 2(84), http://genomemedicine.com/content/2/11/84

[4] Silva JC, Gorenstein MV, Li G-Z, Vissers JPC, Geromanos SJ. (2006) Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol Cell Proteomics 5(1):144-56

[5] Lesur A, Gallien S, Domon B. (2016) Hyphenation of fast liquid chromatography with high-resolution mass spectrometry for quantitative proteomics analyses. Trends in Analytical Chemistry, In Press

[6] Wheelock CE, Goss VM, Balgoma D, Nicholas B, Brandsma J, Skipp PJ, et al. (2013) Application of ‘omics technologies to biomarker discovery in inflammatory lung diseases. Eur Respir J, 42(3):802-25

[7] Karpievitch YV , Dabney AR, Smith RD. (2012) BMC Bioinformatics, 13(Suppl 16):S5 http://www.biomedcentral.com/1471-2105/13/S16/S5

[8] http://www.waters.com/waters/en_GB/ProteinLynx-Global-SERVER-%28PLGS%29/nav.htm?cid=513821&locale=en_GB

[9]http://www.uniprot.org/uniprot/?query=human&fil=organism%3A%22Homo+sapiens+%28Human%29+[9606]%22&sort=score


Financial Disclosure

DescriptionY/NSource
Grantsno
Salaryno
Board Memberno
Stockno
Expensesno

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above:

no