This is a continuation of a series of blogs on teaching computers the language of chemistry. Previous posts are here, and here. A preprint of the paper we describe below can be found here.
In many ways, life is chemistry. Metabolism, DNA replication, cell division, all of life's essential processes are – at their most essential – a series of chemical reactions. These reactions are incredibly complex, with each species throughout evolution performing these reactions in both shared and distinct fashions. The constituents of these reactions include the well-known cellular macromolecules, proteins, RNA, and the genome, but also a huge number of highly diverse small molecules, or metabolites. In a sense, the build-up and break-down of these metabolites is the definition of being alive, and is one that we are only just beginning to understand.
There are two main types of metabolites in nature: primary metabolites and secondary metabolites. Primary metabolites are the pillars of the main functions of cellular life: biosynthesis, energy, reproduction, but in some sense they’re also the boring molecules. All organisms build and break molecules down for energy consumption and storage, and to create the building blocks for their physical structures and energy storage. Their pathways are fairly common across all of biology. Classic examples of primary metabolites include Kreb’s cycle intermediates, amino acids, nucleosides, and ATP.
Secondary metabolites, on the other hand, are more interesting because they’re how the majority of organisms interact with the world, and they’ve evolved in myriad specialized ways. Remember, the vast majority of organisms on the planet – plants, fungus, bacteria, sea sponges – don’t really move or talk, so their primary means of communication is through evolving specialized molecules to protect themselves, compete with neighbors, signal to their kin about danger or resources, or otherwise interact with their environment to suit their needs.
Secondary metabolites are very diverse and there are a lot of them. Exactly how many exist is a subject for debate, but estimates range in the billions. These molecules play an important role in giving organisms their unique properties (what actually makes a garlic, a garlic?) and it’s of incredible value to understand what they are and what they do. This “chemical annotation” is key to understanding disease, creating new drugs, fuels, flavors, pesticides, and biomaterials. So how do we get a handle on them all? This is the central question of the field of metabolomics, the study of all small molecules in biological samples.
Over many decades, chemists have been isolating and characterizing the most abundant and accessible molecules in nature, but isolating a molecule and determining its structure is time consuming and expensive, and we have only made a tiny dent in the total. The most comprehensive open repository of metabolite structures in the world is called COCONUT, and it currently has around 400,000 structures. That’s a lot, but still just a fraction of a fraction of a percent of the billions of unique metabolites thought to exist. The rest are what we call dark chemical space, and that is almost everything in nature.
The workhorse technology of metabolomics is tandem mass spectrometry (MS/MS). This technique helps you to identify a compound by measuring its mass and the mass of its constituent fragments. The first step is to separate all the individual molecules in a complex sample, usually with liquid or gas chromatography. The individual molecules are then ionized and measured for their total mass, this is the MS1 spectrum. These individual molecules are then broken apart into fragments, and the mass of the fragments of the molecule are measured. This is the MS2, or fragmentation spectrum.
Mass spectrometry data is great in that it is readily available and fairly easy to attain. But, it is also a very challenging data type, as sophisticated expertise is needed to do this final reconstitution step. As such, a second, more descriptive technology called NMR is usually needed to determine the structure and identity of an unknown molecule. While NMR is the definitive method to identifying a chemical structure, it is expensive, time-consuming, and requires a substantial amount of highly purified sample, which is often difficult to obtain for unknown secondary metabolites.
There are two categories of existing methods for using MS data to identify a compound and predict its structure. Both are based around database lookup, with one comparing the experimental MS2 to a database of other, known MS2s and looking for matches. This is called spectrum reference lookup. The second, called structure retrieval, involves comparing the experimental MS2 to a database of predicted or hypothesized MS2s from all the known compounds in the major chemical repositories (like COCONUT). While these methods have their benefits, the major problem with both is that they can only help you rediscover known chemistry within your sample , i.e., what is already represented in the databases. This is the bottleneck that has prevented the use of metabolomics data for the exploration of nature’s chemistry.
Our goal with this project was to create an AI algorithm that is able to predict the structure of unknown molecules from the MS2, without relying on any existing databases, in order to make initial metabolite characterization fast, inexpensive, and effective for drug hunters and to expand our collective knowledge of biological chemistry. This is called de novo structure generation, and a few other groups have recently attempted to create models with similar goals. We believe our approach is a major advance relative to these models as we use the latest transformer-based neural networks, we train the models end-to-end, and improve the ways that input spectra and output structures are represented in the model. The pre-print detailing this method and reviewing the relevant literature can be found here. Below, I walk through how we built this tool.
One thing machine learning has excelled at in the past few years is generating never-before-seen realistic data. Think ChatGPT for dialogue. You could, for example, treat a chatbot question-answering as a database lookup problem. Your chatbot gets a question, and you look in a repository of questions and answers and find the closest question that has been asked before, and return the associated answer. This is answer retrieval. ChatGPT is another realm entirely [1]: generating brand new answers in perfect English (or whatever language) because the model knows the language, can interpret the meaning of the query and output a relevant answer whose exact text has never existed.
That’s what we set out to do with mass spectra in our new model. Instead of retrieving the molecular structure from databases, we taught a model to interpret the language of mass spectra – to understand the language of chemical structures – and ask it to translate directly between them. A machine that could do that would not be limited by the small amount already known about nature’s chemistry, but could potentially tell us about the vast dark chemical space as well.
Let’s dive into the details. MS2Mol is a transformer-based encoder/decoder model, akin to that used by ChatGPT and adapted from a machine translation encoder/decoder framework called BART. MS2Mol takes a MS/MS fragmentation spectrum as an input and outputs a chemical structure in the form of a SMILES string. SMILES (simplified molecular-input line-entry system) is a way to write a chemical structure in a linear form, which can easily be converted into a structure diagram.
Transformers are a good fit for learning the language of mass spectra for a couple of reasons, as we have written before here. First, transformers work through a method called self-attention, wherein each element of the input learns to be represented as a combination of itself and the other elements or words of the input. This is important for learning languages, because what you say before or after a certain word can change the meaning of the word. To borrow an example, if I say:
“The cat didn’t cross the street because it was tired.”
What does the word “it” refer to? It refers to cat. But if I say:
“The cat didn’t cross the street because it was raining,” now the word “it” refers to the weather. The meaning of the word “it” depends on the context of the other words. Transformers, unlike previous models, can “attend” to a rather long sequence rather than just the words immediately before and after. Think of how ChatGPT is able to refer back to previous things its said to generate long answers that stay cohesive. This is made possible by self attention.
A similar problem exists with mass spectra. Each mass represents some collection of atoms in a fragment, but those atoms could be arranged any number of ways and still result in the same mass. Groups of masses, regardless of how similar the mass values are, show up together when they’re associated with certain molecular motifs. In the example below from ms2lda.org, a ferulic acid substructure (red) has a distinct set of peaks (also red). Transformers like MS2Mol are able to learn and leverage these relationships to predict how fragments are ordered into rational chemical structures
To test the ability of MS2Mol to learn this new language, we ran an established testing paradigm called “Masked Language Modeling” (MLM). In masked language modeling, we randomly “mask out” an element of a sentence and ask the model to predict the missing words. If the model can predict the missing word, we can surmise that it “understands” the context of each word depends on the others. You can do the same with mass spectra. We trained a masked language model on spectra from a set of molecules, and then asked the model to predict the missing peaks on spectra from new molecules it had never encountered before. We were pleased to see that MS2Mol could predict the correct mass from thousands of possible masses for around 20-40% of the peaks. We further found that we could predict the fragments of higher-quality spectra better than for spectra with lots of noise and nonsense fragments, which is an important sanity check because it indicates the model is extracting real signal rather than somehow memorizing noise.
The first step of MS2Mol is an embedding step where the model encodes what it has learned about the context of each fragment mass in a dense vector of numbers. MS2Mol starts off by tokenizing masses in a unique way, by representing each fragment as two tokens: an integer and a fractional part, rather than by representing each mass singly. This allows the model to know the precise mass values that the fragments take without having to create a large and unwieldy vocabulary of possible “words.” [2]
The next step is to pass these vectors through the transformer encoder/decoder, generating an output one token at a time. To ensure high quality predictions, MS2Mol generates multiple outputs for each prediction simultaneously using a technique called beam search. Invalid molecules (those that break the rules of chemistry) are thrown out, then a second model, called a reranker, prioritizes which of the outputs is most likely to be correct. In this way, MS2Mol provides not just a predicted structure, but the best predicted structure.
One of the interesting things about MS2Mol is how it generates the output molecules. As mentioned above, molecule structure can be represented using strings of letters and symbols called Simplified Molecular-Input Line-Entry System (SMILES) strings, which are typically dozens of characters long even for a fairly modest sized biological molecule. We made the output sequence shorter by training a featurizing model called a byte pair encoder (BPE). BPE works by finding patterns in the SMILES strings that occur together and combining them into a single token. If the combined token also co-occurs with adjacent tokens, it combines those tokens. It keeps on combining tokens iteratively until the vocabulary is not just the individual characters in the SMILES but a vocabulary of common substructures that existed in the data. This allows the model to do things like attach entire rings or extended functional groups in a single step rather than having to spell it out character-by-character. By reducing the vocabulary in this way, we’re able to increase the fidelity of the output. We found BPE to be extremely useful; when we take it out of the model, the model performance drops significantly.
We benchmarked MS2Mol performance against the current state of the art in both spectrum lookup and structure lookup. While a few other lookup-independent methods for structural elucidation have been published, there are not any publicly-available and working instantiations of these models to benchmark against. For spectrum lookup we used a method called modified cosine similarity that is designed to find not only exact matches but also nearby analogs of predicted structures. We used a training set containing about a million spectra from around 50,000 structures, which we gathered by merging 15 major databases. For structure lookup we used CSI:FingerID, part of the powerful SIRIUS suite of tools, which consistently performs at or near the top spot in structure elucidation competitions like CASMI.
We compared model performance on three independent datasets. The first is a dataset that we created expressly for this purpose, called EnvedaDark, which is intended to simulate discovery of the structures of novel molecules from dark chemical space. The 226 naturally-occuring molecules in this dataset aren’t found in the major repositories of known natural products COCONUT and Pubchem. We built this dataset from our internal mass spectrometry lab, generating spectra at multiple collision energies.
Since we were comparing against database retrieval methods that, by definition, can’t guess structures exactly correct if they’re not in databases, we asked if the predicted structure is either an “exact match”, a “close match” or a “meaningful match.” The threshold for these definitions was defined by blinded expert annotations regarding how useful a prediction would be for discovery purposes. You can think of close match as getting the core structure basically correct, and “meaningful match” as being the dividing line between whether the prediction is telling you anything useful about the actual structure or is essentially wrong.
We found that MS2Mol predicts close match structures for about 20% of spectra, compared to 11% and 7%, for CSI:FingerID and modified cosine similarity respectively, and meaningfully similar structures for 62%, compared to 42% and 31% for CSI:FingerID and modified cosine. This represents at least 50% improvement over these established methods. MS2Mol even gets some structures exactly correct. While an exactly correct structure is the ideal, dark chemical space is so big that even incremental improvements to this initial characterization stage can go a long way to telling us what the chemical space is and allows us to prioritize molecules for drug discovery purposes.
For completeness, we also looked at how well MS2Mol identifies already known molecules using two datasets. One is a dataset we dub EnvedaLight, a set of known molecules present in databases but that we excluded from the training dataset, as well as a single selected MS2 spectrum from each molecule of the most recent CASMI competition. We saw that, despite CSI:FingerID being able to reduce the search space to known molecules, MS2Mol performed roughly comparably. Structure retrieval is still better if you a) already know the molecule is in a queriable database and b) need an exact, rather than an approximate, structure. But the fact that MS2Mol produces close-match molecules at roughly the same rate as CSI:FingerID without using a database is exciting, and suggests that generative AI models may eventually be the only models needed to predict everything as accurately as possible, both known and unknown. This has direct implications for our work, as when we are looking at a sample from nature, we do not know which molecules are known and which are unknown. If a single model is able to predict both effectively, we would be able to massively streamline our prediction processes.
In 2022 we released a preprint describing MS2Prop, our first machine learning model for predicting nature’s chemistry. In that case, we were predicting the chemical properties of unknown molecules. MS2Mol is the next logical step: generating the structure of those unknown molecules. While property prediction is highly useful, the structure alone can tell you whether a molecule can be modified by medicinal chemists, whether it has structural complexity linked to successful drugs, whether it can be synthetically derived using current chemistry methods, and so on. At Enveda, we’re systematically profiling nature’s chemical space, starting with the plant kingdom, and measuring its bioactivity to discover novel drugs. We are using both models internally to help prioritize which bioactive molecules to turn into medicines.
First, as we continue to profile chemical space, we will be able to do more targeted data gathering and active learning: discovering regions of dark chemical space that have not been explored yet, and isolating members from those regions to add to our training set. Furthermore, since every extract contains thousands of unlabeled spectra (i.e. spectra from unknown molecules), as we profile the chemistry of nature we accumulate large numbers of these unlabeled spectra. These spectra are useful because, as we pointed out earlier with masked language modeling, AI models can “learn the language” just from unlabeled data. This is how large foundation models like GPT are trained for natural language processing, and we are working on the same kind of thing, but for chemistry.
Finally, there is still a lot to do to improve MS2Mol! While our topline results (60% meaningful and 20% close match) is a big step toward being able to identify unknown molecules, there’s still obviously a lot of room to make this technology better, especially with more stringent similarity requirements. It’s well worth the investment though, and not just for drug discovery. Imagine what fields like synthetic biology and cancer diagnostics could do with a technology that could tell you the exact chemical structure of every molecule in any sample. We believe that this ability to rapidly profile the chemical composition of any natural sample has implications across industries, and for expanding our fundamental understanding of biology and chemistry.
We are proud to be on the forefront of this technological innovation.