In a previous article, GNPS Data with Python and Pandas, we downloaded the GNPS dataset as a .json file and showed you how to clean and analyze the data using Python and Pandas.
We’ll demonstrate an alternative method in this article. Instead of using the .json file, we'll use an .mgf file (also from GNPS). And instead of Pandas, we'll use the specialized Mass Spectrometry libraries matchms (for Cosine distance) and Omigami (for Spec2Vec and MS2DeepScore) to find similar spectra.
This will get you started with finding clusters of similar spectra. And you’ll be more familiar with data formats and tools built specifically for metabolomics data analysis.
[You can find a Jupyter Notebook version of this article here.]
Overview
To follow along, you should have Python and Jupyter Notebook installed. You should also be comfortable using pip or conda to install third-party Python packages. You'll need a computer with at least 8GB of RAM to comfortably load the entire GNPS dataset into memory. This is what we’ll do:
- Download the ALL_GNPS.mgf file;
- Look at the MGF file and see how it compares to JSON;
- Install matchms and use it to read the MGF file;
- Use matchms and Cosine to find similar spectra, with an example spectrum;
- Use Omigami to find similar spectra, using the Spec2Vec and MS2DeepScore algorithms.
Downloading the ALL_GNPS.mgf file
We'll download the .mgf version of the ALL_GNPS dataset. As in the previous tutorial, we'll use wget again to view progress as the download is quite large.
If you don’t have wget, install it first with:
Then download the file by running the following:
JSON vs MGF for the GNPS dataset
The JSON file we worked with before looked like the excerpt shown below. JSON can be read and edited by humans. But it’s primarily a machine-readable format: You'll notice all the data on one line, making it hard to understand and manipulate.
In contrast, the MGF (Mascot Generic Format) file format is primarily a human-readable format. Machines can read it, too. But it’s laid out over more lines and doesn’t use braces or other special characters. This means it’s less space-efficient but easier for humans to read and edit.
An MGF file will use more disk space than an equivalent JSON file. It’s also less interoperable: Generic libraries like pandas can read JSON files, but probably can’t read MGF files. Here is an example excerpt of an MGF file below:
Each piece of metadata has its own line, with the name, followed by an equals sign, the data, and a newline character. Each spectrum is defined in a block between a BEGIN IONS and END IONS statement. And a blank line separates each block.
Installing matchms and parsing the MGF file
You can install the matchms library using pip by running the following:
Built into matchms is a load_from_mgf function. This makes it easy to read our ALL_GNPS file into memory, as follows:
You'll notice that this code executes very quickly. It hasn't actually read the entire file into memory (RAM) yet. Instead it’s created a Python object: This acts like a Python list, but doesn't actually contain any data. It simply knows where to get the data when it’s needed.
A generator is very efficient. But it can also be inconvenient. You can iterate through it like you would a list. But you can’t, for example, access the third spectrum using the syntax specs[2]. For convenience, you can force it to load everything into memory by converting it into a list as follows:
Because data is now being transferred off your hard disk and into RAM, this takes a couple of minutes. You can check your total number of records by looking at the length of the list:
Understanding the matchms Spectrum object
Each item in the list is a Spectrum object. This is something the matchms library provides. On their own, these objects don't look like much. But every Spectrum object has a .metadata property which contains all the metadata for that spectrum. Run the following code to see this:
This output is a Python dictionary containing all the metadata for the first spectrum. When we previously used the JSON file, we noticed that the data for the ionmode attribute was not normalized. And there were several variations. Take a look at this same attribute using matchms as follows:
This will output all the values across the whole dataset, along with their frequency counts:
You'll see that matchms doesn't do much automatic normalization: There are still inconsistencies in casing (Positive vs positive). But the extra spaces present in the JSON version aren’t seen here.
To update a piece of metadata, you have to overwrite the entire .metadata dictionary. We're not going to walk through cleaning up the dataset again. But here's an example of how to change the Positive label for the first spectrum in the dataset to positive:
Now in the output you can see that the ionmode value is lowercase:
Now you’re more familiar with matchms and MGF files, let’s walk through using matchms to find clusters of similar spectra.
Finding similar spectra using matchms
Stenothricin spectra form an interesting gene cluster. Matchms includes built-in functionality to find the similarity between given spectra by looking at the cosine distance between the peak data.
You can use Stenothricin C, ID CCMSLIB00000075068, as a starting point. It exists in the GNPS dataset. To pull it out, you can loop through the specs variable until you find it as follows:
Now you have the spectrum of interest in your stenothricin_c variable.
To calculate the cosine distance (which represents similarity) to all the other spectra in the dataset, you can use the matchms calculate_scores function with the included CosineGreedy similarity function. Run the following:
The queries argument we pass in is a list because you can ask for the similarity of several query spectra at once. In this example, we're just using stenothricin_c. So you'll pass in a list with only a single item. It should take one minute to calculate the similarity scores for all other spectra in the dataset (around 500,000). You can find the results in the returned scores attribute. Take a look at its shape (rows and columns), as follows:
This score’s object is an ndarray (multidimensional array). It returns a 2D matrix with a column per query we submitted. In our case, our matrix only has one column as we only passed in a single spectrum as a query.
This has calculated the similarity score between our spectrum and every other spectra in the database. But we’re only interested in the scores that have a high similarity rating. We can find the indices of the best matches using numpy argpartition as follows:
This shows us the indices of our best matches in the scores array. We can use this to look up the actual scores in the scores array. And we can get the data about the matching spectra from specs. Run the following:
This should show you the top ten matches:
In the for loop, we use [::-1] to reverse the list as the more similar spectra are towards the end. We print out the similarity between our “query” spectrum (Stenothricin C) and each of the top ten matches. Because we used an example from the dataset as a query spectrum, we get one perfect match (with itself) and then some other close matches with Hydroxysprengerinin C and Stenothricin G.
If we happen to know the spectrumids of some other spectra in the Stenothricin cluster, we can find the similarity to each as follows:
And you should see the similarity to each of these:
These other spectra are part of the Stenothricin cluster. But interestingly, most of their cosine similarity scores are low.
Using Omigami to compute Spec2Vec similarity scores
"Similarity" is often thought of as a fixed metric. In reality, there are many different ways for things to be similar to – or different from – each other. Cosine distance is one way to measure how similar to one another spectra are. But there are many other ways to do this.
Spec2Vec, which we've written about in detail, is often a better measure than Cosine distance. But it's also harder to calculate.
Luckily, Omigami makes it easy. Head over to Omigami.com and sign up for an account. You'll need this to use their API.
Once you've signed up at omigami.com, head over to the Omigami GitHub and follow the installation and configuration instructions.
Now you can submit your spectra to Omigami directly. This skips the need to download the GNPS dataset at all. Omigami works directly with MGF files, so first save our query spectrum to a new file as follows:
This will create a file called "query.mgf" with a single spectrum in it: the Stenothricin C sample we used above. Now you can submit that query to Omigami as follows:
The object you get back from Omigami is already nicely formatted. So you don’t have to do custom formatting as you did for the Cosine similarity scores. Your output should look like this below:
As with the Cosine score, we see Stenothricin C as the top match: It's exactly the same sample so we expected this. Unlike before, in second place we now have Stenothricin D with a match of 70% – using Cosine distance gave us only a 10% match to this spectra.
Stenothricin G does not appear in Spec2Vec’s top matches though. This shows how tricky similarity can be.
You can request more matches from Omigami by changing the value of the n_best_matches variable in the example code above. And if you want more metadata than just the name of matches returned, you can add more items to the include_metadata list.
Trying out MS2DeepScore with Omigami
If you want to try out different similarity algorithms, you can simply switch out the import and client initialization. For example, to try MS2DeepScore instead of Spec2Vec, you'd only need to make two changes to the code.
The rest of the code remains identical as Omigami provides a common interface to different algorithms.
You should see something like this:
Note that in both examples we see the same spectrum (Ginsenoside Rb1) returned several times, with slightly different similarities. In a real use-case, it would be important to do some more cleaning and duplicate removal. This prevents a single match from clogging up the returned results. You can adjust the `n_best_matches` variable to get more results and then deduplicate based on the spectrum name.
MS2DeepScore is what we recommend for most real-world use cases. The scores it returns are not directly comparable to Spec2Vec or the Cosine scores as MS2DeepScore is based on tanimoto similarity scores. For more information, take a look at our MS2DeepScore article and our Spec2Vec article.
Where next?
You've got a good grasp of how to work with GNPS data using matchms and you’ve seen the basics of how Omigami works.
For more advanced analysis, take a look at the Omigami documentation. We’re constantly testing and adding the latest metabolomics algorithms, through the same easy and scalable API.