Spectral similarity scores are commonly used as a proxy for structural similarity. But in many cases the two differ significantly. This can call the results of your library comparisons into question. Florian and his team have built a deep learning model that successfully predicts structural similarity directly from tandem mass spectra.
Spectral comparisons are part of every metabolomics workflow
Comparing your mass-spectral measurements against library measurements is a crucial step in any metabolomics analysis pipeline. You need reliable spectral comparisons to,
- Identify your compound: search a reference library for a compound matching your fragmentation pattern.
- Search for analogs: find structurally related compounds through similar fragmentation patterns.
- Aid in structure elucidation: link your molecule to a molecular class, to narrow potential assignments.
With tandem mass spectrometry, you typically compare mass spectra by calculating the cosine score between them. A high cosine score tells you that the fragmentation profiles of two compounds overlap well.
You might assume a high cosine score correlates to high structural similarity. Unfortunately, cosine scores only tell you about the similarity of spectra. But when you want to annotate your compounds, you’re interested in structural similarity. And that’s a crucial difference.
Spectral and structural similarity: Apples and oranges
Let’s look at the differences:
- Spectral similarity, for example cosine similarity, measures the overlap between two mass spectral profiles.
- Structural similarity, like the Tanimoto score, quantifies the overlap of two molecular fingerprints using a 2D-vector representation of the molecular structures.
Spectral similarity can differ significantly from structural similarity. This can lead you to the wrong conclusions from your library searches. Let's see how that can affect your work.
Take for example, the antibiotic daptomycin (top) and a hypothetical analog (bottom):
- High structural similarity: The structure of the analog (bottom) is still highly similar to it’s un-modified form: daptomycin (top).
- Low spectral similarity: The modifications in the analog carry over to many of the fragments and shift many of the mass-spec peaks. As a result, few overlapping peaks remain, so the spectral similarity score is low.
This means that if your sample included a daptomycin analog, you would NOT get a hit from either a library or an analog search, even if you had a daptomycin reference spectra in the library!
This means if you rely solely on cosine similarity,
- You can’t be sure that matches with the highest cosine score are really the best structural matches.
- Your library might contain a close structural match, but you won’t see it because the match received a very low spectral similarity score.
However, Florian Huber and his team have found a way to predict structural similarity directly from tandem mass spectra.
How MS2DeepScore predicts structural similarity from tandem mass spectra
MS2DeepScore can predict when compounds are closely related structural analogs, even when other comparison methods, like cosine similarity, may fail.
Florian Huber and his team trained MS2DeepScore using over 100,000 spectra from over 15,000 unique compounds. Once trained, the MS2DeepScore model predicts the Tanimoto similarity scores between two compounds directly from their respective MS/MS profiles.
Using what it learned, MS2DeepScore consistently outperforms both cosine similarity and Spec2Vec in relating structurally similar compounds, even when the compound's fragmentation profiles are shifted. Importantly, the compounds on which the scores’ accuracy were compared were not included in the MS2DeepScore training dataset.
In our example, the cosine score between daptomycin and our analog was only 0.24. From this, we would assume that these compounds are not structurally close. But the MS2DeepScore model predicts a Tanimoto score of 0.95, very close to the actual Tanimoto score of 0.98.
Working with MS2DeepScore
Getting started with MS2DeepScore is simple, and you can begin comparing spectra with it in minutes.
You can use MS2DeepScore no matter what instrument you collect with. MS2DeepScore works with the matchms module, so you can use a choice of several open source data types like .mgf, .mzxml and .mzml.
Florian and his team designed MS2DeepScore as a python package that includes Jupyter notebook tutorials to get you started. You can begin comparing your spectra using the model pre-trained on 109,000 spectra of 15,000 compounds. To learn more about MS2DeepScore and its development, be sure to check out the preprint publication.
What’s next for MS2DeepScore?
We’re building new ways to leverage MS2DeepScore for library matching, including an API, an easy-to-use interface, and continuously updated models.
This new version will provide,
- A new model trained on over 230,000 spectra;
- Library matching against GNPS public databases;
- A new negative mode comparison.
If you want to try it out, don’t hesitate to contact us to receive beta access.