MSACL 2018 US Abstract

Topic: Informatics & Analytics

The Importance of Reproducible Research in High-Throughput Biology: Case Studies in Forensic Bioinformatics

Keith Baggerly (Presenter)
UT MD Anderson Cancer Center

Bio: Keith Baggerly is the Ransom Horne, Jr, Professor of Cancer Research in the Department of Bioinformatics and Computational Biology at the UT MD Anderson Cancer Center, where he has worked since 2000. He has worked extensively with data from a wide variety of high-throughput biomedical assays. His work has been featured in Science, Nature, the front page of the New York Times, and the US TV news program 60 Minutes, and prompted an Institute of Medicine (IOM) review of the evidence that should be required before omics-based assays are used to guide therapy. He is a Fellow of the American Statistical Association.

Authorship: Keith Baggerly
UT MD Anderson Cancer Center

Short Abstract

We present case studies in reverse engineering of high profile results from high throughput biology (forensic bioinformatics) illustrating how simple errors in experimental design and data analysis may have put patients at risk. We discuss these in the context of ongoing efforts by the NIH to enhance research reproducibility.

Long Abstract


Modern high-throughput biological assays let us ask detailed questions about how diseases operate, and promise to let us personalize therapy. Our intuition about what the answers “should” look like in high dimensions is very poor, so careful data processing is essential. When documentation of such processing is absent or incomplete, we must apply “forensic bioinformatics” to work backwards from the raw data and the reported results to infer what the methods must have been.


We consider examples from both mass spectrometry and gene expression studies where important claims of clinical utility were made. We tracked down the underlying raw data and annotation files from both primary and secondary sources and applied a series of sanity checks amounting to positive and negative controls: (1) Do simple approximate analyses give similar results (e.g., do contrasts using two-sample t-tests flag the same genes as important in the same direction). (2) Do plots of related data over time (a) show the effects claimed and (b) not show larger effects associated with undiscussed and biologically uninteresting other factors. (3) Do counts of numbers of responders/nonresponders match those in the source literature.


Genes identified with t-tests show no overlap with lists initially reported, but almost perfect overlap with lists obtained by looking “a few rows down” in the initially complete list, suggesting indexing errors. Plots of data as a function of assay run date show far stronger effects than those associated with the nominal biological contrast, though the factors are initially collinear – followup over time shows the time (batch) effect to be the primary driver. Counts of responders and nonresponders show numbers reversed from the initial sources, suggesting label reversal.

Conclusions & Discussion

The most common errors we uncover are simple ones, often involving mislabeling of rows, columns, or variables. These errors are easy to make, but if documentation is adequate, they may be easy to fix. Incomplete documentation is, however, pervasive in much of the scientific literature. Fixing the mistakes discussed here took years. Fortunately, new tools (many from the open-source community) have been introduced in the past few years which make documentation much easier.

References & Acknowledgements:

Baggerly et al, Bioinformatics, 2004,

Baggerly and Coombes, Ann Applied Statist, 2009,

IOM, 2012,

NIH, 2016,

Financial Disclosure

Board Memberno

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above: