MSACL 2016 US Abstract

Identification of Bacteria by MALDI by Matching to Translated DNA Databases

Kenneth Parker (Presenter)
SimulTOF Systems

Bio: Ken Parker is a senior scientist at SimulTOF Systems in Sudbury, which is a company founded by Martin Vestal. Dr. Parker has previously written peptide mass fingerprinting programs capable of matching multiple proteins to complex protein mixtures. Previously, Ken Parker has been director of a proteomic core lab at Harvard Partners. Before that, Ken Parker worked at NIH on measuring binding of peptides to MHC class I molecules, and predicting relative binding affinities from these data. Ken Parker got his PhD at Harvard in 1984 in the lab of Jack Strominger.

Authorship: Kenneth C. Parker
SimulTOF Systems

Short Abstract

Software has been written that enables bacterial identification starting from MALDI spectra of colony extracts by mapping directly against a database of mostly ribosomal protein sequences extracted from complete proteomes in public repositories. Every organism whose protein sequences have been deposited can potentially be identified using this approach, whether or not that organism has ever been grown in culture. Each of several thousand bacterial strains receives scores, together with tables listing identified protein sequences. The results for identifying certain Gammaproteobacteria and Firmicutes derived from ATCC collections will be shown. So far, most species included in the downloaded database that have been tested have been identified using this approach.

Long Abstract


In recent years, MALDI-TOF has been accepted as a rapid and reliable means for identifying pathogens starting typically from bacterial colonies (Ryzhou and Fenselau, 2001; Patel, 2015; Karlsson et al., 2015; Singhal et al., 2015 ). In this approach, MALDI spectra are matched against library spectra gathered from identified colonies. We have developed an alternative approach in which MALDI spectra gathered on the SimulTOF 100 linear mass spectrometer are mapped directly against a database of mostly ribosomal protein sequences downloaded from public repositories. Many papers document that many of the strong signals in MALDI spectra correspond to ribosomal proteins (Ryzhou and Fenselau, 2001), yet this knowledge has not been directly used to attempt identification of bacteria from all known phyla. In our case, the protein sequence database to be searched is housed in sqlite3, which allows users to query the organisms and determine how many organisms and strains encode any particular protein sequence. The protein database can be replaced with newer releases as desired. Each strain is mapped to a complete taxonomic key by the public database. So far, every colony tested that can be identified by the standard extract library matching approach can also be matched by direct comparison to the downloaded protein sequence database.


MALDI spectra were gathered from ATCC typed bacterial strains provided to us by Chris Cox and Kent Voorhees, Colorado School of Mines. Some bacteria were isolated from our lab environment as well. The TrEMBL database of protein sequences was downloaded from the UniProt FTP site, and ribosomal protein sequences were extracted from it using a C sharp program. This ribosomal protein database can be supplemented if desired with other proteins found to be readily detectable in MALDI spectra, for example, DNA-binding protein HU. Some bacteria in our extract collection were absent from TrEMBL, and were separately downloaded from organism-specific FASTA files from NCBI.

A MALDI matching program was written in C Sharp to provide matching statistics for each organism in the ribosomal protein database. This program allows the user to restrict mass space as desired from both the input MALDI spectra and the protein sequence database. Matching scores are calculated based on the percentage intensity of the spectrum that is matched, the percentage of the proteins in the database that is matched, and from the average mass error of the matches. The program returns these statistics for every organism in the database, if desired, and also allows users to view which proteins are matched to each peak. The matching process generates a calibration file from the top organism hit, which can be used to internally calibrate the spectrum if desired. In cases of borderline identification, reducing mass tolerance is found to increase organism selectivity. Failure to increase organism selectivity upon reducing the window of mass tolerance for matching as appropriate to the quality of the spectrum is evidence of identification failure, or at least identification ambiguity. The software can tolerate peak lists with up to several hundred components. A wide range of mass tolerances can be investigated. The software superimposes the theoretical masses on top of the raw mass spectrum prior to peak detection, if desired.

Depending of the version of the database, there may be from several dozen to thousands of strains in the database that are keyed to commonly studied organisms like E. coli and S. aureus. The software provides to the user tables of mapped ribosomal proteins for the top 20 hits, and also allows the user to generate a dendrogram that shows strain relationships for any desired number of matches, based on which protein sequences are matched. In this fashion, the user can get a sense of how much strain differentiation is possible based on the protein sequences in the database.


The results for matching extracts from certain Firmicutes and Gammobacteria will be shown. The data returned from the database matching method indicate that certain organism ‘complexes’ like Escherichia coli and nominal Shigella species can be readily separated into clades that do not correspond to the named Shigella species or Escherichia coli boundaries. It is well known that Shigella is not monophyletic with respect to E. coli (Lan and Reeves, 2002), but the ribosomal repertoire of Shigella species indicates additional confusion in the assignment of completely sequenced Shigella strains to particular species. In most cases examined, clades that are separable by ribosomal protein sequences are also separable upon consideration of the whole proteome. These results indicate that some of the limitations of MALDI in mapping spectra to particular species are due to lack of a scientific consensus in the delineation of these species. With this caveat, most colonies that we have studied have been identified to an appropriate clade using the translated proteome approach.

Prior to matching, the masses of the predicted ribosomal proteins have been systematically adjusted according to the N end rule (Hirel et al., 1989) for excision of N-terminal methionine when followed by glycine, alanine, serine, valine, cysteine and proline. Matching is performed for both the singly charged and doubly charged form of each protein. We have collected some evidence that the N end rule ought to be adjusted for certain taxa. If desired, we can also select for matching to proposed modifications of any particular kind, for example, methylation. As consistent with the literature (Ryzhou and Fenselau, 2001), we have found that ribosomal protein L33 for many Enterobacteria appears to be constitutively methylated. We are in the process of determining how widespread this modification is among the bacterial species available to us, and how many other ribosomal proteins may be similarly modified.

In conclusion, it is possible to identify bacteria by MALDI-TOF starting from DNA databases without making spectral libraries from bacterial colonies that have been identified by other means. Some of the perceived limitations in identifying bacteria by MALDI appear to be a consequence of confusion in the delineation of bacterial species. We expect that this methodology will be useful in characterizing environmental bacterial samples, and in refining techniques for extracting proteins from bacterial samples. Until bacterial strains are better mapped to species, this methodology appears to be useful in deducing limitations in understanding strain identifications. In the clinic, this methodology may be most useful in identifying rarely encountered bacterial strains.

References & Acknowledgements:

Acknowledgments. We thank Chris Cox and Kent Voorhees (Colorado School of Mines) for having provided colony extracts from ATCC typed bacteria.


1.) Ryzhov, V. and Fenselau, C. (2001), “Characterization of the protein subset desorbed by MALDI from whole bacterial cells”, Anal. Chem. 73, 746–750.

2.) Patel, R. (2015) MALDI-TOF MS for the diagnosis of infectious diseases, Clin Chem. 61:100-11. PMID: 25278500

3.) Karlsson R, Gonzales-Siles L, Boulund F, Svensson-Stadler L, Skovbjerg S, Karlsson A, Davidson M, Hulth S, Kristiansson E, Moore ER. (2015) Proteotyping: Proteomic characterization, classification and identification of microorganisms--A prospectus. Syst Appl Microbiol. 38:246-57. PMID: 25933927.

4.) Singhal N, Kumar M, Kanaujia PK, Virdi JS. (2015) MALDI-TOF mass spectrometry: an emerging technology for microbial identification and diagnosis. Front Microbiol 6:791. PMID: 26300860

5.) Lan R, Reeves PR (2002) Escherichia coli in disguise: molecular origins of Shigella. Microbes Infect. 4:1125-32.

6.) Ruelle, V., El Moualij, B., Zorzi, W., Ledent P. and De Pauw, E. (2004) Rapid identification of environmental bacterial strains by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Rapid Commun. Mass Spectrom. 18: 2013–2019.

7.) Hirel PH, Schmitter MJ, Dessen P, Fayat G, Blanquet S. (1989) Extent of N-terminal methionine excision from Escherichia coli proteins is governed by the side-chain length of the penultimate amino acid. Proc Natl Acad Sci U S A. 86:8247-51.

Financial Disclosure

SalaryyesSimulTOF Systems
Board Memberno
Stockyes SimulTOF Systems
ExpensesnoPlease Select

IP Royalty: no

Planning to mention or discuss specific products or technology of the company(ies) listed above: