Olivier Pible, Francois Allain, Guylaine Miotello, Jean Armengaud
CEA-Marcoule, DSV/IBICTEC-S/SPI/Li2D, Laboratory “Innovative technologies for Detection and Diagnostic”
Mass spectrometry is a powerful tool to identify pathogens. However some issues such as mixture handling are usually beyond reach of whole-cell MALDI-TOF approaches. We developed a tandem mass spectrometry approach which not only can address complicated samples such as mixtures of any organisms, but can also give access to relative quantitation of pathogens. It is based on the analysis of the peptide content of the sample and the extraction of phylogenetic information. An universal organism signature has been characterized using this molecular information. The identification problem is then reduced to the search of the linear combination of organism signatures which best matches the overall mass spectrometry signal.
Whole-cell MALDI-TOF based mass spectrometry has revolutionized clinical microbial identification. However, pure single-species colonies are required for data acquisition to be matched to reference spectra from the database. This is a limitation which gives rise to numerous efforts to fill the gap towards real life complicated samples that could be analyzed without culture (PMID: 12510738).
We propose here to use the wealth of tandem mass spectrometry (tandem MS) data which allows the quasi-sequencing of tens of thousands peptide sequences in a single run, coupled to the explosion of NGS sequencing data, to open a new field in clinical sample processing. Tandem MS offers in a single run access to both information on proteins in the sample, and taxonomical information. The issue of deciphering taxonomical linkage of more or less conserved peptides which can be found in hundreds of different organism proteomes is a major question to be addressed.
We introduce a new method to handle this issue, using additional phylogenetic information which allows the characterization of an organism signature depicting for any taxon the expected level of occurrence of associated spectra based on the putative presence of a given organism.
To illustrate the method and uncover its potential in terms of relative quantitation, we present results obtained on artificial mixtures prepared with varying weight ratios of peptide extracts from two closely related pathogens, both from the Enterobacteriaceae family: Shigella flexneri and Salmonella bongori. These clinically-relevant pathogens share from 55 to 70% spectra associated with expressed peptides.
Mixtures of digested peptides from both species were prepared in triplicate, with the following Shigella:salmonella ratios: 1:0, 1:0.05, 1:0.1, 1:0.2, 1:0.5, 1:1, 0.5:1, 0.2:1, 0.1:1, 0.05:1 and 0:1.
NanoLC-MS/MS experiments were performed with a LTQ-Orbitrap XL hybrid mass spectrometer (ThermoFisher) coupled to an UltiMate 3000 LC system (Dionex-LC Packings). Peptides were resolved using a 90-min gradient from 4 to 50% CH3CN solvent.
The NCBInr database (13th of february, 2015 release) was used for peptide inference, taxonomical attribution and phylogenetics processing. A Python written pre-process of the NCBInr database was performed to extract phylogenetic information. A selected set of proteins found ubiquitously in all life superkingdoms in the taxa proteomes present in the NCBInr database was used for this purpose.
The number of MS/MS spectra across all these experiments was 9814 ± 502. Peptide assignation was performed with the MASCOT engine using the NCBInr database, with an assignement level of 3052 ± 415 across all experiments at a p-value of 0.05, and an average processing time of 89 minutes. A post-process pipeline written in Python was then applied to obtain spectra to taxa and to phylogenetic information assignations, then identification and relative quantitation of both organisms, with an average process time below 15 minutes.
A first output of the method is the determination of the organisms present in a sample. This analysis is based on the examination of MS/MS spectra pertaining to taxa at a given taxonomical level, starting at the superkingdom level then gradually descending the taxonomical levels. Spectra from validated taxa are excluded and the next best taxon is searched for until each clade is populated with a number of spectra coherent with the higher taxonomical level. Specific spectra are also used for taxa validation or confidence assessment.
The second output is the use of these taxa to fit a linear combination of the corresponding proteomic signatures to the overall metaproteomic signal, i.e. the number of spectra per taxon. In the case of a mixture of Shigella flexneri and Salmonella bongori species, the fitted contributions of each signature are used for the ratio estimation. Without using this fitted data, with the mixtures with only shigella material, the number of spectra attributed to Salmonella is on average 55% of the number of spectra attributed to Shigella. With Salmonella only material, the Shigella signal is 70% of the salmonella signal. Using fitted data, the species ratio estimations are completely in line with the ratios used to prepare the samples. The results for the 0.05 to 1 ratios show that a 1 to 20 ratio of one species compared to the other can be detected.
The method that we present is expected to benefit from improvements in 3 fields. (i) the data used is the same as in standard proteomics, so protocol or peptide inference improvements in this field should be applicable to this method. (ii) NGS databases improvements are currently exponential, and allow for a better species coverage and fit quality. (iii) tandem mass spectrometry is also rapidly evolving, with speed and sensitivity improvements that are direct gains to the method.
The signature model includes flexibility to address variations in the expression level of peptides of varying inter-species conservation, which makes the fit robust to intra or inter-species fluctuations. It is also capable of indicating if a detected taxon from the database is the correct assignation or only a related organism.
The outstanding potential of this method is expected to allow the direct exploration of samples of clinical interest without the need and limitations of time-consuming cultivation.