Large scale mass spectrometry based lipidomics studies are steadily becoming the norm, and as a result, the need to run samples in multiple batches is unavoidable. Integration across numerous analytical runs is critical to enhance the power of the study and to confirm lipid identification and differences among test groups. By leveraging variables specific to untargeted lipidomics, such as: retention times, gradient information, and internal standards, it is possible build a data-dependent model that can have a wide range of applicability. One such variable, retention time, is a key factor for accurate feature matching across batches. It is batch specific and influenced by multiple factors including, chromatography column stability, age and performance. Alignment of retention time is indispensable in batch correction as is a vital variable in feature identification and matching. Currently, most retention time correction tools either look at intra-batch drift or by a default method with pre-defined parameters which can be inaccurate if you have across batch retention shifts of over a minute. After features alignment, correction of technical variations can done. Herein, we described a bioinformatics approach to determine retention time shift among batches, improve the feature alignment accuracy, data imputation, and batch effect correction.
To evaluate the model we have utilized clinical samples from a large cohort study of environmental enteropathy (EE) and malnutrition in Pakistani children. Untargeted ultra-performance liquid chromatography – mass spectrometry was performed on 421 serum samples assayed across 11 analytical batches. Each batch was individually pre-processed in XCMS for the initial peak extraction and intra-batch retention time correction. XCMS parameters were adjusted to allow for a maximum number of features to be seen.
One of the major challenges in multi-batched data is the identification of common features across batches. To align features from different batches, firstly retention time correction must be addressed. Therefore, we propose performing retention alignment utilizing an internal standard cocktail method. In utilizing multiple internal standards, the retention shift can be mapped across the runs and adjustment of the retention time of the features can be performed based on this trend. After retention time shift is addressed, alignment is performed by matching retention time and calculating mass accuracy, in our case within 10 ppm error. After feature alignment, in addition to those features found across all batches, there will be a significant number of additional features that would be missing from 1 or 2 of the 11 batches. Inclusion of these features is necessary but imputation needs to be done before batch effect correction in order to avoid skewing of the algorithm. Imputation would be modeled using internal standards distribution variation and feature distribution in other batches.
In this study of serum samples, XCMS outputs of each individual batch generated on average ~25,000 ion features in the negative mode with roughly 60-75% being removed during noise filtration. Feature alignment across all 11 batches resulted in over 350 common features. An additional ~250 features were found missing in only 1 batch and another ~250 features were found missing in 2 batches. Data imputation for these features for the missing batches would add an additional ~500 features that would initially be excluded. After the addition of these features, batch effect correction would be applied to remove the technical variation.
Our preliminary data showed the usability of the tool in the feature alignment of large scale untargeted metabolomics data analysis with a specific example of a large clinical study of malnutrition. Future directions for this tool are for additional validation studies to assess accuracy, including feature identification and statistical analysis.