Introduction : SpiderMass is an ambient mass spectrometry technology that offers a promising solution for in vivo cancer diagnosis and prognosis. The accurate prediction of cancer requires the use of sophisticated AI algorithms such as supervised machine learning, and the selection of the best suited classification model can have a significant impact on the accuracy of the results. The objective of this research was to develop a reliable pipeline for the automated selection of the most efficient classification model and the extraction of potential biomarkers.
Methods : The study used various machine learning classification models to determine the one with the highest predictive accuracy and shortest training time. Python and its open-source packages, including scikit-learn, pandas, numpy, scipy, lazy predict, eli5, matplotlib and seaborn, were used to develop an automated pipeline. The pipeline consists of three main modules: 1) selection of the optimal classifier, 2) extraction of specific ions that influence the prediction of each class, and 3) comparison of the relative abundances of all m/z peaks between classes to identify those that are significantly different with a p-value of at least 0.05. The input data consisted of cancer tissues, including ovarian, esogastric, and glioma cancers.
Results : The study was successful in identifying the most efficient model for each type of data and extracting biomarkers with high statistical significance. The pipeline highlighted the key features involved in class characterization and compared them to the results of the third module that tested all features between the different classes. Features that were consistent between the two modules were considered potential biomarkers. For example, in the case of glioma cancer, the linear SVC model was found to be the most effective classification model with a 93% good classification rate. The analysis revealed potential biomarkers, such as the presence of specific ions in cancerous tissue compared to healthy and necrotic tissue (e.g. a high abundance of PAs in cancerous tissue compared to PCs and PSs in healthy tissue and an absence of PIs in necrotic tissue).
Discussion : The findings of this study emphasize the importance of selecting an efficient classification model and using feature extraction techniques for cancer diagnosis and prognosis. The potential biomarkers identified in this study could be incorporated into a large database to advance personalized medicine and aid in the development of new drug treatments. Furthermore, combining other mass spectrometry technologies and analyzing other omics data could provide a more comprehensive view of the cancer, leading to more accurate diagnosis and prognosis.