MS2Mol: a transformer encoder-decoder model for exploring the dark chemical space of nature

MS2Mol: a transformer encoder-decoder model for exploring the dark chemical space of nature

"Translating Nature to Medicine" by Stable Diffusion XL

This is a continuation of a series of blogs on teaching computers the language of chemistry. Previous posts are here, and here. A preprint of the paper we describe below can be found here.

In many ways, life is chemistry. Metabolism, DNA replication, cell division, all of life's essential processes are – at their most essential – a series of chemical reactions. These reactions are incredibly complex, with each species throughout evolution performing these reactions in both shared and distinct fashions. The constituents of these reactions include the well-known cellular macromolecules, proteins, RNA, and the genome, but also a huge number of highly diverse small molecules, or metabolites. In a sense, the build-up and break-down of these metabolites is the definition of being alive, and is one that we are only just beginning to understand.

Metabolites: the stuff of life

There are two main types of metabolites in nature: primary metabolites and secondary metabolites. Primary metabolites are the pillars of the main functions of cellular life: biosynthesis, energy, reproduction, but in some sense they’re also the boring molecules. All organisms build and break molecules down for energy consumption and storage, and to create the building blocks for their physical structures and energy storage. Their pathways are fairly common across all of biology. Classic examples of primary metabolites include Kreb’s cycle intermediates, amino acids, nucleosides, and ATP. 

Secondary metabolites, on the other hand, are more interesting because they’re how the majority of organisms interact with the world, and they’ve evolved in myriad specialized ways.  Remember, the vast majority of organisms on the planet – plants, fungus, bacteria, sea sponges – don’t really move or talk, so their primary means of communication is through evolving specialized molecules to protect themselves, compete with neighbors, signal to their kin about danger or resources, or otherwise interact with their environment to suit their needs. 

Secondary metabolites are very diverse and there are a lot of them. Exactly how many exist is a subject for debate, but estimates range in the billions. These molecules play an important role in giving organisms their unique properties (what actually makes a garlic, a garlic?) and it’s of incredible value to understand what they are and what they do. This “chemical annotation” is key to understanding disease, creating new drugs, fuels, flavors, pesticides, and biomaterials. So how do we get a handle on them all? This is the central question of the field of metabolomics, the study of all small molecules in biological samples. 

Over many decades, chemists have been isolating and characterizing the most abundant and accessible molecules in nature, but isolating a molecule and determining its structure is time consuming and expensive, and we have only made a tiny dent in the total. The most comprehensive open repository of metabolite structures in the world is called COCONUT, and it currently has around 400,000 structures. That’s a lot, but still just a fraction of a fraction of a percent of the billions of unique metabolites thought to exist. The rest are what we call dark chemical space, and that is almost everything in nature. 

Identifying metabolites with mass spectrometry

The workhorse technology of metabolomics is tandem mass spectrometry (MS/MS). This technique helps you to identify a compound by measuring its mass and the mass of its constituent fragments. The first step is to separate all the individual molecules in a complex sample, usually with liquid or gas chromatography. The individual molecules are then ionized and measured for their total mass, this is the MS1 spectrum. These individual molecules are then broken apart into fragments, and the mass of the fragments of the molecule are measured. This is the MS2, or fragmentation spectrum.

Mass spectrometry data is great in that it is readily available and fairly easy to attain. But, it is also a very challenging data type, as sophisticated expertise is needed to do this final reconstitution step. As such, a second, more descriptive technology called NMR is usually needed to determine the structure and identity of an unknown molecule. While NMR is the definitive method to identifying a chemical structure, it is expensive, time-consuming, and requires a substantial amount of highly purified sample, which is often difficult to obtain for unknown secondary metabolites.

There are two categories of existing methods for using MS data to identify a compound and predict its structure. Both are based around database lookup, with one comparing the experimental MS2 to a database of other, known MS2s and looking for matches. This is called spectrum reference lookup. The second, called structure retrieval, involves comparing the experimental MS2 to a database of predicted or hypothesized MS2s from all the known compounds in the major chemical repositories (like COCONUT). While these methods have their benefits, the major problem with both is that they can only help you rediscover known chemistry within your sample , i.e., what is already represented in the databases. This is the bottleneck that has prevented the use of metabolomics data for the  exploration of nature’s chemistry.

Summary of three broad approaches to predicting structure of molecules from mass spectra, with the size of natural chemical space accessible by each.

Our goal with this project was to create an AI algorithm that is able to predict the structure of unknown molecules from the MS2, without relying on any existing databases, in order to make initial metabolite characterization fast, inexpensive, and effective for drug hunters and to expand our collective knowledge of biological chemistry. This is called de novo structure generation, and a few other groups have recently attempted to create models with similar goals. We believe our approach is a major advance relative to these models as we use the latest transformer-based neural networks, we train the models end-to-end, and improve the ways that input spectra and output structures are represented in the model. The pre-print detailing this method and reviewing the relevant literature can be found here. Below, I walk through how we built this tool.

Generative AI to go beyond database lookups

One thing machine learning has excelled at in the past few years is generating never-before-seen realistic data. Think ChatGPT for dialogue. You could, for example, treat a chatbot question-answering as a database lookup problem. Your chatbot gets a question, and you look in a repository of questions and answers and find the closest question that has been asked before, and return the associated answer. This is answer retrieval. ChatGPT is another realm entirely [1]: generating brand new answers in perfect English (or whatever language) because the model knows the language, can interpret the meaning of the query and output a relevant answer whose exact text has never existed.

That’s what we set out to do with mass spectra in our new model. Instead of retrieving the molecular structure from databases, we taught a model to interpret the language of mass spectra – to understand the language of chemical structures – and ask it to translate directly between them. A machine that could do that would not be limited by the small amount already known about nature’s chemistry, but could potentially tell us about the vast dark chemical space as well.

MS2Mol: translating mass spectra into chemical structures

Let’s dive into the details. MS2Mol is a transformer-based encoder/decoder model, akin to that used by ChatGPT and adapted from a machine translation encoder/decoder framework called BART. MS2Mol takes a MS/MS fragmentation spectrum as an input and outputs a chemical structure in the form of a SMILES string. SMILES (simplified molecular-input line-entry system) is a way to write a chemical structure in a linear form, which can easily be converted into a structure diagram. 

Why transformers?

Transformers are a good fit for learning the language of mass spectra for a couple of reasons, as we have written before here.  First, transformers work through a method called self-attention, wherein each element of the input learns to be represented as a combination of itself and the other elements or words of the input. This is important for learning languages, because what you say before or after a certain word can change the meaning of the word. To borrow an example, if I say:

The cat didn’t cross the street because it was tired.”

What does the word “it” refer to?  It refers to cat.  But if I say:
The cat didn’t cross the street because it was raining,” now the word “it” refers to the weather. The meaning of the word “it” depends on the context of the other words. Transformers, unlike previous models, can “attend” to a rather long sequence rather than just the words immediately before and after. Think of how ChatGPT is able to refer back to previous things its said to generate long answers that stay cohesive. This is made possible by self attention.

A similar problem exists with mass spectra. Each mass represents some collection of atoms in a fragment, but those atoms could be arranged any number of ways and still result in the same mass. Groups of masses, regardless of how similar the mass values are, show up together when they’re associated with certain molecular motifs. In the example below from, a ferulic acid substructure (red) has a distinct set of peaks (also red). Transformers like MS2Mol are able to learn and leverage these relationships to predict how fragments are ordered into rational chemical structures

In both natural language and mass spectrometry, the individual tokens have meanings that depend on the context of the tokens around them. Transformers excel at learning these relationships. Image credit left: Illustrated Transformer. Image credit right: ms2lda

Transformers can learn the language of mass spectrometry

To test the ability of MS2Mol to learn this new language, we ran an established testing paradigm called “Masked Language Modeling” (MLM). In masked language modeling, we randomly “mask out” an element of a sentence and ask the model to predict the missing words. If the model can predict the missing word, we can surmise that it “understands” the context of each word depends on the others. You can do the same with mass spectra. We trained a masked language model on spectra from a set of molecules, and then asked the model to predict the missing peaks on spectra from new molecules it had never encountered before. We were pleased to see that MS2Mol could predict the correct mass from thousands of possible masses for around 20-40% of the peaks. We further found that we could predict the fragments of higher-quality spectra better than for spectra with lots of noise and nonsense fragments, which is an important sanity check because it indicates the model is extracting real signal rather than somehow memorizing noise.

Left: a diagram of masked language modeling in the BERT transformer. A model’s ability to predict a masked-out word in a sentence is evidence it is learning the grammatical structure of that language. Right: we can do the same with mass spectra, masking out real fragments (top, red), then asking the model to predict the missing ones (green, bottom). Models can learn to predict the correct fragment an impressive amount of the time, which is evidence that they are learning the “grammar” of mass spectrometry.

The architecture of MS2Mol

The first step of MS2Mol is an embedding step where the model encodes what it has learned about the context of each fragment mass in a dense vector of numbers. MS2Mol starts off by tokenizing masses in a unique way, by representing each fragment as two tokens: an integer and a fractional part, rather than by representing each mass singly. This allows the model to know the precise mass values that the fragments take without having to create a large and unwieldy vocabulary of possible “words.” [2]  

The next step is to pass these vectors through the transformer encoder/decoder, generating an output one token at a time. To ensure high quality predictions, MS2Mol generates multiple outputs for each prediction simultaneously using a technique called beam search. Invalid molecules (those that break the rules of chemistry) are thrown out, then a second model, called a reranker, prioritizes which of the outputs is most likely to be correct. In this way, MS2Mol provides not just a predicted structure, but the best predicted structure.

MS2Mol architecture

One of the interesting things about MS2Mol is how it generates the output molecules. As mentioned above, molecule structure can be represented using strings of letters and symbols called Simplified Molecular-Input Line-Entry System (SMILES) strings, which are typically dozens of characters long even for a fairly modest sized biological molecule. We made the output sequence shorter by training a featurizing model called a byte pair encoder (BPE). BPE works by finding patterns in the SMILES strings that occur together and combining them into a single token. If the combined token also co-occurs with adjacent tokens, it combines those tokens. It keeps on combining tokens iteratively until the vocabulary is not just the individual characters in the SMILES but a vocabulary of common substructures that existed in the data. This allows the model to do things like attach entire rings or extended functional groups in a single step rather than having to spell it out character-by-character. By reducing the vocabulary in this way, we’re able to increase the fidelity of the output. We found BPE to be extremely useful; when we take it out of the model, the model performance drops significantly. 

An illustration of byte pair tokenization for deepSMILES, the variant of SMILES strings used in MS2Mol. The learned vocabulary tokens reflect long subsets of characters that reflect common motifs in the training data. Ultimately, this allows the model to do things like attach phosphates, phenyl rings, or hydrocarbon chains with a single token rather than dozens.

MS2Mol is better at guessing structures in dark chemical space than database alternatives

We benchmarked MS2Mol performance against the current state of the art in both spectrum lookup and structure lookup. While a few other lookup-independent methods for structural elucidation have been published, there are not any publicly-available and working instantiations of these models to benchmark against. For spectrum lookup we used a method called modified cosine similarity that is designed to find not only exact matches but also nearby analogs of predicted structures. We used a training set containing about a million spectra from around 50,000 structures, which we gathered by merging 15 major databases. For structure lookup we used CSI:FingerID, part of the powerful SIRIUS suite of tools, which consistently performs at or near the top spot in structure elucidation competitions like CASMI.

We compared model performance on three independent datasets. The first is a dataset that we created expressly for this purpose, called EnvedaDark, which is intended to simulate discovery of the structures of novel molecules from dark chemical space. The 226 naturally-occuring molecules in this dataset aren’t found in the major repositories of known natural products COCONUT and Pubchem. We built this dataset from our internal mass spectrometry lab, generating spectra at multiple collision energies. 

Since we were comparing against database retrieval methods that, by definition, can’t guess structures exactly correct if they’re not in databases, we asked if the predicted structure is either an “exact match”, a “close match” or a “meaningful match.” The threshold for these definitions was defined by blinded expert annotations regarding how useful a prediction would be for discovery purposes.  You can think of close match as getting the core structure basically correct, and “meaningful match” as being the dividing line between whether the prediction is telling you anything useful about the actual structure or is essentially wrong.

Examples of expert-annotated pairs of actual and predicted structures. Many of these annotations together were used to determine thresholds of molecule similarity that correspond to a prediction being “meaningfully similar” or “closely matched” to the actual test compound.

We found that MS2Mol predicts close match structures for about 20% of spectra, compared to 11% and 7%, for CSI:FingerID and modified cosine similarity respectively, and meaningfully similar structures for 62%, compared to 42% and 31% for CSI:FingerID and modified cosine. This represents at least 50% improvement over these established methods. MS2Mol even gets some structures exactly correct. While an exactly correct structure is the ideal, dark chemical space is so big that even incremental improvements to this initial characterization stage can go a long way to telling us what the chemical space is and allows us to prioritize molecules for drug discovery purposes.

Accuracy predicting chemical structures from a dark chemical space test set.  Exact match (top), close match (middle) and meaningfully similar (bottom) thresholds are shown, as well as for just the top-ranked predictions (right) or top-k predictions (left).

For completeness, we also looked at how well MS2Mol identifies already known molecules using two datasets. One is a dataset we dub EnvedaLight, a set of known molecules present in databases but that we excluded from the training dataset, as well as a single selected MS2 spectrum from each molecule of the most recent CASMI competition. We saw that, despite CSI:FingerID being able to reduce the search space to known molecules, MS2Mol performed roughly comparably. Structure retrieval is still better if you a) already know the molecule is in a queriable database and b) need an exact, rather than an approximate, structure. But the fact that MS2Mol produces close-match molecules at roughly the same rate as CSI:FingerID without using a database is exciting, and suggests that generative AI models may eventually be the only models needed to predict everything as accurately as possible, both known and unknown. This has direct implications for our work, as when we are looking at a sample from nature, we do not know which molecules are known and which are unknown. If a single model is able to predict both effectively, we would be able to massively streamline our prediction processes. 

Where do we go from here?

In 2022 we released a preprint describing MS2Prop, our first machine learning model for predicting nature’s chemistry. In that case, we were predicting the chemical properties of unknown molecules. MS2Mol is the next logical step: generating the structure of those unknown molecules. While property prediction is highly useful, the structure alone can tell you whether a molecule can be modified by medicinal chemists, whether it has structural complexity linked to successful drugs, whether it can be synthetically derived using current chemistry methods, and so on. At Enveda, we’re systematically profiling nature’s chemical space, starting with the plant kingdom, and measuring its bioactivity to discover novel drugs. We are using both models internally to help prioritize which bioactive molecules to turn into medicines.  

At Enveda, we use both MS2Prop (preprint and blog here) and MS2Mol to prioritize hits from our bioassay screens so we prioritize following up on the best drug candidates. MS2Prop predicts chemical properties that may be relevant to drug development, while MS2Mol predicts the structures

First, as we continue to profile chemical space, we will be able to do more targeted data gathering and active learning: discovering regions of dark chemical space that have not been explored yet, and isolating members from those regions to add to our training set. Furthermore, since every extract contains thousands of unlabeled spectra (i.e. spectra from unknown molecules), as we profile the chemistry of nature we accumulate large numbers of these unlabeled spectra. These spectra are useful because, as we pointed out earlier with masked language modeling, AI models can “learn the language” just from unlabeled data. This is how large foundation models like GPT are trained for natural language processing, and we are working on the same kind of thing, but for chemistry. 

Finally, there is still a lot to do to improve MS2Mol! While our topline results (60% meaningful and 20% close match) is a big step toward being able to identify unknown molecules, there’s still obviously a lot of room to make this technology better, especially with more stringent similarity requirements. It’s well worth the investment though, and not just for drug discovery. Imagine what fields like synthetic biology and cancer diagnostics could do with a technology that could tell you the exact chemical structure of every molecule in any sample. We believe that this ability to rapidly profile the chemical composition of any natural sample has implications across industries, and for expanding our fundamental understanding of biology and chemistry.

We are proud to be on the forefront of this technological innovation.

[1] This same kind of generative AI kind of underlies previous machine translation years before chat GPT: instead of a question you get a query in one language, and the machine must produce a translation even if that exact translation has never existed before. The machine is not looking up its responses in databases, but generating the translation one word at a time because it understands the structure of both the input and output language, and knows how to translate between them.
[2] A note about vocabulary size: if you have measured each mass value down to 0.001 Da precision, which is common for modern mass spectrometers, if you use fractional encoding you can represent all masses between 0.000 and 1000.000 with pairs of 2000 tokens (900 integer tokens and 1000 fractional tokens) rather than the million tokens it would take to assign each mass between 0.000 and 1000.000 its own token. Small vocabularies are important because they help avoid overfitting to the training set. If your vocabulary size is too large relative to the training data, then a token may show up only once in the entire training set, and the model could learn to associate that one mass to the exact training structure it came from, which then can cause the model to predict the wrong structure if sees that mass again in a different context.

Join Us By Subscribing To Follow Our Progress.