MSACL 2016 US Abstract

Automated Tumor Typing of Tissue Sections Based on MALDI Mass Spectrometry Imaging Data and Machine Learning Using Characteristic Spectral Patterns

Tobias Boskamp (Presenter)
University of Bremen

Authorship: Tobias Boskamp (1,2), Delf Lachmund (1), Janina Oetjen (1), Rita Casadonte (3), Jan Hendrik Kobarg (2), Jörg Kriegsmann (4), Peter Maass (1,2)
(1) University of Bremen, Bremen, Germany (2) SCiLS GmbH, Bremen, Germany (3) Proteopath GmbH, Trier, Germany (4) Center for Histology, Cytology and Molecular Diagnostic, Trier, Germany

Short Abstract

We present an automated classification method for MALDI mass spectrometry imaging data with applications to tumor typing of FFPE tissue sections. The proposed method consists of a) data pre-processing, b) identification of characteristic spectral patterns using non-negative matrix factorization (NMF), and c) applying linear discriminant analysis (LDA) for classification. We apply this method to the discrimination of breast, lung, colon and pancreas cancer. MALDI data has been acquired from eight tissue micro arrays (TMAs), two for each tumor type, with a total of 943 cores from 285 patients. Four TMAs have been used for training, the remaining four TMAs for validation. A sensitivity on core level of 100.0% (lung), 99.5% (pancreas), 100.0% (colon), and 100.0% (breast) was achieved. Only limited effects of different preprocessing variants (normalization, filtering) were observed.

Long Abstract

Introduction

Matrix assisted laser desorption and ionization (MALDI) mass spectrometry imaging is receiving more and more attention in the field of clinical research, due to its capability to analyze the proteomic fingerprint of formalin-fixed paraffin-embedded (FFPE) tissue samples. A promising application is the automatic discrimination between different tumor types based on MALDI imaging data as a diagnostic support tool in clinical histopathology.

Typical automated classification methods include pre-processing the data (normalization, baseline removal, denoising), identifying individual ion masses (m/z values) relevant to the individual classification task, and using the measured spectral intensities at the respective m/z values for training a classification model. We propose to replace the selection of individual m/z values by the identification of characteristic spectral patterns (CSPs) that represent a set of ‘building blocks’ from which the original data can be approximated. This approach allows to reduce the dimensionality of the feature data used for the classification model, and thus helps to avoid overfitting and to improve the classifier’s robustness and accuracy.

Methods

We investigated eight tissue micro arrays (TMAs) of needle core biopsies from lung, pancreatic, colon and breast tumors (two TMAs for each type, 943 cores of 285 patients in total). Tissue sections of 5 µm thickness were in-situ trypsin digested (0.1 µg/µl) and sprayed with alphacyano-4-hydroxycinnamic acid matrix solution (7 mg/ml in 50/50 acetonitrile / 0.5% TFA) using an ImagePrep sprayer (Bruker Daltonik). Spectral data were acquired using an Autoflex Speed TOF-TOF system (Bruker Daltonik) in reflector mode at a spatial resolution of 100 µm and sampled to 25,250 m/z bins in the mass range 519 to 4,478 Da.

The data sets were imported into Matlab (Mathworks), pre-processed, and divided into a training and a validation set of four TMAs each. Non-negative matrix factorization (NMF) was used to extract CSPs from the training set spectra. Lower dimensional feature data was obtained by linear projection of the measured spectra onto each of the CSPs. The resulting set of feature vectors was used to train a linear discriminant analysis (LDA) classification model.

Validation was performed by projecting all validation set spectra onto the CSPs extracted from the training data and applying the trained LDA model. For each TMA core in the validation set, the tumor type was predicted by selecting the class assignment occurring most often among all spectra within the respective core.

In addition, we analyzed the effects of typical normalization and data filtering steps and performed the above training and validation process on six different versions of pre-processed data. As the normalization method, either total ion count (TIC) or median normalization was used. Data filtering was done by applying either a three element mean or maximum filter, or no filter at all. All computational experiments were repeated using the alternative data configuration where the roles of training set and validation set were exchanged.

Results

The best classification performance was achieved after applying three element maximum filtering of the raw data, with either TIC or median normalization. Using a set of 50 CSPs, all 464 TMA cores of the validation set (50 lung, 167 pancreas, 166 colon, 81 breast tumor samples) were correctly classified (sensitivity = 100.0% for all tumor types). In the alternative configuration where the validation set consisted of 479 TMA cores (97 lung, 207 pancreas, 94 colon, 81 breast tumor samples), only one pancreas tumor sample was misclassified as colon tumor (sensitivity = 99.5% for pancreas, 100.0% for lung, colon, breast). When evaluating the classification performance on spectra level, sensitivity on the validation set with 15,629 spectra was 95.7% (lung), 92.5% (pancreas), 97.9% (colon), and 98.6% (breast), (alternative configuration with 15,123 spectra: 98.5%, 95.9%, 96.9%, 94.3% for lung, pancreas, colon, breast, resp.)

Only little effects of the different pre-processing methods were observed when analyzing the classification performance using the maximum number of 50 CSPs for feature extraction. The choice of the normalization method influenced the classification results on the spectra level, but not on the core level. When applying mean filtering or no filtering at all instead of maximum filtering, the minimum sensitivity on core level over all tumor types and both training / validation set configurations dropped from 99.5% to 98.8% and 97.5%, resp.

More significant differences were observed, however, when analyzing the number of features necessary to achieve a certain classification performance level. With the combination of TIC normalization and maximum filtering, only the first 35 out of 50 CSPs were required to achieve a minimum sensitivity of 98.0% over all tumor types and configurations. Using any of the other combinations of normalization and data filtering, this number increased to values between 43 and 49 CSPs.

Conclusions

Our results demonstrate that NMF based feature extraction methods can be used to design robust and highly accurate classification algorithms for automated tumor typing. The resulting sensitivity and specificity levels outperform previous approaches where sets of individual m/z values rather than characteristic spectral patterns were used to extract spectral features for classification and prediction. Moreover, the proposed method allows to limit the number of extracted features to values ranging between 35 to 50 CSPs, thus increasing robustness and reducing the risk of overfitting.

Only little effects of the normalization and data filtering variants on the resulting classification accuracy were observed, although the exact number of features required for a given accuracy level may depend on the pre-processing method.


References & Acknowledgements:

This work has been partly funded by the German Federal Ministry of Education and Research under the “SME innovative” programme, contract number 13GW0081.


Financial Disclosure

DescriptionY/NSource
Grantsno
SalaryyesSCiLS GmbH, Bremen, Germany
Board Memberno
Stockno
Expensesno

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above:

no