MSACL 2018 US Abstract

Topic: Data Science

Data Mining Routine Results for Reference Intervals: Common Errors and Modern Techniques

Daniel Holmes (Presenter)
St. Paul's Hospital, Univ. of British Columbia

Bio: Daniel Holmes did his undergraduate degree in Chemical Physics from the University of Toronto. He went to medical school at the University of British Columbia (UBC) where he also did his residency in Medical Biochemistry. He is a Clinical Professor of Pathology and Laboratory Medicine at UBC and Division Head of Clinical Chemistry at St. Paul's Hospital in Vancouver. Interests include laboratory medicine statistics, clinical endocrinology with a focus on secondary hypertension, clinical lipidology and clinical mass spectrometry. Assay development efforts in the last five years have focused on novel use of mass spectrometry for assays directed at specialized endocrine testing.

Authorship: Holmes, Daniel Thomas
(1) St. Paul's Hospital, Vancouver, BC, Canada (2) University of British Columbia, Vancouver, BC, Canada

Short Abstract

Commonly employed graphical techniques (Hoffman and Bhattacharya) for parameter estimation of Gaussian Mixture Models will be demonstrated and discussed in application to “mining” reference intervals from the results of routine analyses. A frequent error in the implementation of the Hoffman Method and its ramifications will be demonstrated. The modern approach to this parameter estimation problem is maximum likelihood (ML) through the expectation maximization algorithm. This approach is implemented in a number of packages in the R programming language and will also be demonstrated. The benefit of ML approach is that it is not necessarily constrained by the assumption that the underlying mixture is Gaussian. This allows for fitting of skewed distributions without the application of normalizing transformations affording better results.

Long Abstract


There is a large body of literature on the topic of “data mining” reference intervals from routine patient results. While some approach this problem by querying the electronic health record and excluding results from patients who have medical conditions potentially affecting the analyte of interest, most approaches are agnostic to diagnostic information and entirely data-oriented. For the most part, the methods employed in clinical chemistry are graphical approaches to Gaussian mixture model decomposition. The simplest and most widely adopted in the field of clinical chemistry was suggested by Hoffman [1]. Another commonly employed method, developed by Bhattacharya [2], is more difficult but affords better parameter estimates. However, both methods have significant weaknesses, which can easily elude the user. Additionally, many papers inadvertently implement the Hoffman method incorrectly leading to entirely avoidable errors in reference interval estimation. The modern approach to mixture model decomposition [3,4,5], maximum likelihood through the expectation maximization algorithm, has not been discussed in the clinical mass spectrometry literature. These approaches produce better parameter estimates of normal mixtures and are not constrained to assumptions of normality but allow direct fitting of skewed distributions without normalizing transformations.


The Hoffman and Bhattacharya methods were coded in the R statistical programming language and applied to authors’ original data sets to confirm accuracy. Three implementations of the maximum likelihood using the expectation maximization algorithm in the R programming language were employed: mixtools, mclust, and mixdist. All approaches were applied to randomly generated datasets with varying degrees of overlap of healthy and diseased modes and varying dispersion. The performance of all methods was assessed. Fitting of apparently Gaussian and skewed distributions of real routine chemistry and mass-spectrometric results was performed. Goodness-of-fit was explored with QQ-plots.


The Hoffman method performs reasonably when implemented as originally proposed but will predictably overestimate the upper limit of normal (assuming healthy patients have lower analyte results) when the proportion of diseased individuals exceeds ~30%. However, if the Hoffman method is inadvertently employed without the use of a normal QQ-plot, parameter estimates are unreliable in a mathematically predictable manner. ML approaches consistently outperform both the Hoffman and Bhattacharya methods and allow for simultaneous fitting of all modes of the distribution. For the decomposition of distributions showing skewed modes, the mixdist package is able to produce better results by fitting gamma, lognormal (and other) distributions as assessed by goodness-of-fit testing.

Conclusions & Discussion

The Hoffman method as originally proposed and the Bhattacharya method can produce good parameter estimates, particularly when the proportion of healthy individuals is high and presence of healthy and diseased populations is obvious in the histogram. ML approaches outperform these graphical methods but must still be subjected to clinical “sanity checking” and goodness-of-fit evaluation. All methods in their most commonly employed forms are hampered by assumptions of normality and while normalizing transformations have been proposed to mitigate this problem, use of ML approaches fitting to skewed distributions can avoid it entirely. All approaches, no matter how sophisticated, require careful clinical review.

References & Acknowledgements:

[1] Hoffmann RG. Statistics in the practice of medicine. JAMA. 1963 Sep 14;185(11):864-73.

[2] Bhattacharya CG. A simple method of resolution of a distribution into Gaussian components. Biometrics. 1967 Mar 1:115-35.

[3] Benaglia T, Chauveau D, Hunter D, Young D. mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software. 2009 Oct;32(6):1-29.

[4] Fraley C, Raftery AE, Scrucca L. Normal mixture modeling for model-based clustering, classification, and density estimation. Department of Statistics, University of Washington. 2012;23:2012.


Financial Disclosure

Board Memberno

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above: