Calculating the similarity between two MS/MS mass spectra is an essential step in most untargeted metabolomics workflows.
However, it’s long been known that cosine similarity – the most popular similarity measure – has significant weaknesses.
Justin van der Hooft and his team have developed a new approach to calculating spectral similarity: Spec2Vec. This is really helpful for analyzing untargeted metabolomics results in biomarker discovery or natural product research.
Let's see how Spec2Vec works and why it’s a good alternative to cosine similarity.
Why mass spectral similarity metrics matter
A datapoint in an untargeted metabolomics dataset is, at first, just a measured entity. We don’t know yet what molecule it might be. Hence, we can’t interpret its significance: Is it something new? What is it’s metabolomic function?
But if you can match this entity against a database, like the Human Metabolome Database, then you can add (annotate) important information to that entity. And in order to match your measurement and the entries in the database, you need a similarity metric.
This is how mass spectral similarity metrics are mainly used:
- Spectral library matching: Matching a compound against a spectral library (such as HMDB).
- Analogue search: Finding the top hits in a large structural library, to look for structurally related molecules.
- Mass spectral networks: Clustering related compounds in a mass spectral network.
Any mistake can be costly: If you can’t find a match, then your compound might seem to be an exciting new drug candidate, when in reality, it’s a well-known molecule; If you only discover this at a much later stage of your research, then you’ve wasted a lot of effort.
The better your similarity metric is, such mistakes will happen less frequently.
The drawbacks of cosine similarity
Cosine similarity measures how much one mass spectrum overlaps with another mass spectrum. If two mass spectra have near perfect overlap, then the cosine similarity is high – and their molecular structures are likely very similar.
Sometimes though, you have structurally similar molecules whose mass spectra do not overlap – and in those cases, cosine similarity will likely fail. Here’s how that happens:
- Large molecules modified in several places: Large molecules – as often found in plants and microorganisms – can easily get several small modifications in their molecular structures.
- Several fragments are shifted: If you fragment a molecule that has several modifications, then not one, but several of its fragments also retain a modification. As a result, several fragment peaks shift on the mass spectrum.
- Few peaks overlap perfectly: If lots of fragments are shifted, the overall spectrum no longer overlaps neatly with the unmodified molecule.
- Low cosine similarity: In terms of cosine similarity (overlap), the two molecules therefore seem unrelated and you’ll see a low cosine score.
- Wrong conclusions: As a result, you might think a library has no molecule similar to your compound, when in fact it does. This could provoke the conclusion that you’ve found a novel molecule, even though what you’ve found is just a slight modification of a well-described molecule.
This is a well-known problem but no fundamentally alternative mass spectral similarity metrics have been proposed as solutions, until Spec2Vec.
Spec2Vec: Unsupervised learning of spectral similarities
At the basis of Spec2Vec is a powerful new assumption: If two fragmentation peaks often show up together, across thousands of mass spectra, then that's probably because they come from the same molecular substructure.
So those two peaks are related in some way. And now, if you only see one of these peaks in one molecule and the other in another molecule, you can make a guess that the same substructure, maybe with small modifications, is present in both molecules.
Spec2Vec implements precisely this idea. But on a large scale, Spec2Vec takes a dataset of many thousands of mass spectra and learns from this dataset what peaks are related. Then you can use what Spec2Vec learned to calculate spectral similarity between any two mass spectra.
If this works, then a molecule with several small modifications – like the previous example – shouldn’t fool Spec2Vec. Now, does this work?
Comparing Spec2Vec with cosine similarity
Let’s look at the results of two tests. For the tests, we use a large molecular library from GNPS that contains molecule structures and their spectra.
Test 1: How well do Spec2Vec scores correlate with structural similarity?
First you need a ground truth against which you can compare Spec2Vec and cosine similarity. Justin and his team took 12,797 unique compounds and used their known molecular structures to directly calculate the structural similarity score (Tanimoto score) for each pair of molecules.
For each of the same pairs (81,875,206 pairs in total), they also calculated cosine, modified cosine, and Spec2Vec scores, but this time, only using the library spectra, not the known structures. This was the result:
- Spec2Vec correlates considerably better with structural similarity;
- Only when spectra overlap almost perfectly do all metrics perform similarly well.
Spec2Vec’s high correlation with structural similarity (Tanimoto scores) is very promising, but does this translate to better library matches?
Test 2: Can Spec2Vec find the right matches in a library search?
In a second test, Justin’s team selected 1,000 unique molecules from the spectral library that also had at least four planar structural equivalents (these look the same in 2D) in the library. These near-equivalent molecules remained in the library.
The task then was to figure out if the molecules that had the same 2D structure would get the highest similarity scores.
The results were in line with the first test: Again, Spec2Vec found considerably more reliable matches than cosine similarity, across all score ranges.
But Spec2Vec has even more advantages.
Spec2Vec is much faster
Cosine similarity calculations are very expensive on mass spectra because of the preprocessing they need. They’re so expensive that it’s usually impractical to make comparisons against an entire large database.
A Spec2Vec model, once it’s trained, is extremely fast: roughly 100-times faster than cosine similarity scoring. Even if you include the time to train the Spec2Vec model, it’s still around 10-times faster overall (without taking any possible optimizations in cosine similarity into account).
This means that with Spec2Vec, you can easily run all-vs-all searches against large libraries, without any limitations.
And Spec2Vec isn’t difficult to implement, even though it's a much more advanced approach to similarity scoring.
How to integrate Spec2Vec into your workflow
You can implement Spec2Vec with an easy-to-use, Python open-source package that is freely accessible on Github: https://github.com/iomega/spec2vec. You can install it with Anaconda (recommended) or pip.
You can even download the pre-trained Spec2Vec model Justin used for the tests above, from Zenodo. And you can calculate mass spectral similarities without training your own model.
>> Important to note: Spec2Vec is an unsupervised algorithm. This means you can train it on any set of MS/MS spectra you have, without any additional information.
This is just the beginning
Spec2Vec is a great approach and can probably be easily extended:
- GC-MS: So far, Spec2Vec has only been tested on LC-MS data. It might be even more valuable for GC-MS studies because in GC-MS measurements, you can't limit library searches by pre-filtering by the similarity of precursor ions.
- Supervised: It's great that Spec2Vec is unsupervised. But you could also build a supervised model: For example, you could train a model that predicts the Tanimoto score directly from the spectral embeddings. This might give you even more accurate models.
- Improved Mass2Motifs: Spec2Vec vector embeddings could potentially be used to train more powerful Mass2Motifs.
- Improved Network Annotation Propagation (NAP): Similarity scores are essential in building mass spectral networks, so if Spec2Vec produces more useful networks, it could also improve your annotation and speed up structural elucidation.
In general, any tool that currently uses cosine structural similarity can probably benefit from integrating Spec2Vec as an alternative similarity metric.
Further Reading
If you want to learn more about Spec2Vec:
- Try the Spec2Vec python package;
- Try Spec2Vec within GNPS;
- Read the Spec2Vec paper; and
- Check out Florian Huber’s Spec2Vec Tutorial.
Do you need help applying machine learning to omics research?
We can help your research team set up the right infrastructure for reproducible and easily deployable machine learning models. Just get in touch.