MSACL 2018 US Abstract

Topic: Data Science

Data Mining at Scale – Turning the Data Gaze into the Data Result

Adam Zabell (Presenter)
Indigo BioAutomation

Bio: Adam Zabell graduated with his PhD in Medicinal Chemistry from Purdue University in the Fall of 2000 and worked as the Principal Scientist of a pharmaceutical software support team prior to joining Indigo BioAutomation in 2011. His work has enabled Indigo to expand into pharmaceutical and genetic screening client groups, and his R&D focus is on pattern recognition and simplifying user interface design.

Authorship: Adam P.R. Zabell (1), Randall K. Julian (1)
(1) Indigo BioAutomation, Indianapolis, IN

Short Abstract

When data retention rates scale from thousands of observations into billions, the opportunity to discover an outlier event goes up even as the ability to identify it goes down. The common methods of outlier detection in LC-MS/MS are designed to provide answers to an established question on a limited timescale. Expanding that timescale to use all of the available data is as much about the answer as it is the discovery of what question to ask. It is in this middle ground where data gazing, focused on human-centered pattern recognition, is essential prior to question codification and answer automation. We present several recent examples showing different patterns in the clinical setting, and work backwards to show the necessarily human process of ‘query and revise.’ In this intermediate step, it is the capacity for data curation which causes the biggest gap in understanding the system.

Long Abstract


A single batch of large-panel LC-MS/MS data from a typical clinical laboratory contains 80 samples and monitors 70 compounds, producing roughly ten thousand chromatograms which must be analyzed and compared before delivering customer data. Although reportable subject results are a fraction of that number, all of the information must be collected and analyzed prior to report generation. In order to trust in the reliability of this report, several quality control steps are invoked and turned into a trend analysis like the Levy-Jennings plot, or an outlier detection method like the Westgard Rules. These data mining tools are built to find an outlier, or a trend, or a scale for an event. However, per-batch and per-month analytic tools remain confined to the scale of the ad hoc analysis. To expand the scope of questions in the hopes of becoming predictive instead of simply prescriptive, the kinds of analyses being performed need to provide a deeper focus along a broader timescale.


With over three billion chromatograms, we followed a “data gazing” protocol to expressly take advantage of human pattern recognition on the various observations for individual peaks (e.g. peak area) and the common relationships between peaks (e.g. ion ratio). Three basic patterns – the outlier spike, the trendline shift, and the scaled range – are broadly classified as inflection points. Rather than judge any one pattern as correct we work from the assumption that a change in the observation is sufficient to identify a changing system and pursue additional investigation using non-instrumental data sources. Because standard statistical models often break down with large quantities of data, we worked with robust statistics using medians, trimmed means, and quartile ranges.


Each inflection point was correlated to one of two scenarios: an independently erroneous sample or a system maintenance event. The former cases tend to be well characterized and identified by current per-sample and per-batch quality control checks. The latter cases were not always identified in maintenance logs, and relied on otherwise anecdotal evidence from the laboratory floor. A system maintenance event did not necessarily correlate with an increased failure rate for per-sample quality control checks, are not typically visible when a batch is taken as the largest data size, and were not managed by standard Levy-Jennings control plots.

Conclusions & Discussion

Knowing that each data field has a particular use associated with reporting a clinical result, it was expected that each would reveal information about a different portion of overall system stability. This analysis is often called “data mining” when working with billions of records, but in practice these kinds of analysis are more akin to “data gazing” and start from the assumption that an inflection point is present. The biggest challenge is not the identification process but the curation of data to track between different scales of information – per compound, sample, batch, or instrument – and different sources of information – instrument, log book, or LIMS – and automatic curation requires access to both. Based on this work we recommend each instrument retain a running mean of three basic parameters: retention time, internal standard peak area, and regression slope. The ion ratio between the quantifier and qualifier peak of a compound was too variable to be of particular merit. Once the instrument has “gone live” these results should be kept for the lifetime of the method, and retained afterwards as a means to measure the improvement in any future method.

References & Acknowledgements:

Financial Disclosure

SalaryyesIndigo BioAutomation
Board Memberno
Stockyes Indigo BioAutomation

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above: