Text2Enzyme: MP4 AI generates proteins across full EC universe
Written by Oleg Matusovsky and Kathy Y. Wei, Ph.D. on Oct. 1 2024
- MP4 is a versatile AI molecule programming model that can also generate specific enzymes from text input.
- The model has been used to build the EC repo, which showcases 6,590 AI-generated protein sequences covering all seven enzyme classes, and includes 725 novel, potentially active enzymes. Explore here.
- Discover over 5,300 AI-generated enzymes binding with cofactors and substrates here.
From Your Medicine Cabinet to Your Cheese Plate: The Surprising Role of Enzymes in Everyday Life
Enzymes are nature’s molecular machines, efficiently catalyzing a wide range of biological processes — even in minuscule amounts. These essential proteins work tirelessly behind the scenes, quietly orchestrating vital bodily functions from digesting our food to defending us against disease.
The Enzyme Commission (EC) system classifies all known enzymatic reactions. It consists of four levels of organization, from general (e.g., oxidoreductases) to specific (e.g., alcohol dehydrogenase). The system includes 7 first-level, 79 second-level, 320 third-level, and 8,269 fourth-level categories.
Text2Enzyme: Transforming Enzyme Design with an AI Molecule Programming Model
Molecule Programming model version 4 (MP4) revolutionizes protein design by using simple natural language prompts to generate sequences with specific functions. MP4 tackles one of the three major challenges in protein science: programmability. While significant progress has been made in solving the protein folding problem (predicting a protein’s structure from its sequence) and the inverse folding problem (designing sequences to achieve a specific structure) with solutions like AlphaFold3 and ProteinMPNN, respectively, the challenge of programmability — designing proteins to perform specific functions — remains largely unsolved. MP4 is trained using approximately 3,800 AMD-Instinct GPU-days.
MP4 has been used at scale to generate a collection of >6,500 enzymes, encompassing 100% of top-level, 87% of second-level, and 73% of third-level Enzyme Commission (EC) space. This comprehensive collection is freely accessible for exploration here. EC repo features predicted structures for all the enzymes generated by the text2enzyme model, including 5,306 structures docked with suitable ligands (cofactors and substrates) and 781 structures with predicted binding sites.
The AI-generated enzymes in the EC repo show high sequence and structural quality as well as good functionality indicators. The repository includes novel proteins with low sequence similarity to known proteins and predicted to belong to the EC class that the input prompt specified. Predicted structural models of the text2enzyme sequences show a blend of helices and sheets, indicating that the MP4 model is capable of designing varying secondary structures. More than 80% of the enzymes have been docked with appropriate ligands (including coenzymes and substrates), and the ligand binding site of an additional ~12% have been predicted.
EC1 Oxidoreductase: Electrical experts
Oxidoreductases are the electrical powerhouses of the cell, facilitating the transfer of electrons between molecules. Imagine them as conduits that pass energy down a chain. By driving these essential reactions, oxidoreductases are crucial for cellular respiration and metabolism, ensuring that our energy levels stay charged and operational.
Glucose Oxidase (EC 1.1.3.4), produced by Novozyme under the brand name Gluzyme® Fortis, is extensively utilized across various industries. It is especially beneficial in food production, where it enhances dough stability and improves dough-handling characteristics.
An EC1 example from EC repo is M1Y7K, an alcohol dehydrogenase (yellow cartoon) which is a fairly novel protein sequence with a seqdif score of 54. The seqdif, a metric for evaluating sequence novelty, is a derivative of the percentage identical matches (pident) to known protein sequences. Values of seqdif (defined as 100 — pident) above 50 indicate sequence novelty. The protein also has a very realistic structure with a very low structdif score of 4.6, which translates to a tmscore of 0.954 (structdif is defined as 100 — tmscore * 100). The nlmsim score, a measure of text-to-text similarity between the input prompt and ProtNLM function prediction, is also high, indicating that the predicted function highly correlates to the input prompt. NADH docked to the AI-generated enzyme binds to the same pocket as in the experimental NADH-bound dehydrogenase structure (gray cartoon), indicating the reliability of the enzyme generated by MP4.
EC2 Transferase: Swap specialists
Transferases are like molecular couriers, swapping functional groups between molecules. This transfer process is essential for many biological functions, including the construction of new proteins and the detoxification of harmful substances.
Alanine Aminotransferase (EC 2.6.1.2), produced by Roche Diagnostics under the brand name ALT, is used in clinical diagnostics to measure alanine transaminase levels in human blood. It is widely employed in hospitals for diagnosing liver diseases such as hepatitis and cirrhosis. In 2023, Roche reported global sales of diagnostic tests, including ALT, totaling approximately $17B.
An interesting example is MVEK7, a novel EC 2.7.4.2 transferase (green cartoon) with a seqdif score of 52.1 is shown. The enzyme is a phosphomevalonate kinase which has its substrate (PMV) docked in the same pocket as a crystallographic PMV-bound PMV kinase (gray cartoon). While the sequence is novel, the structdif score shows that the structure model is very similar to the naturally-occurring kinase, highlighting the potential of the AI-generated enzyme for the target function.
EC3 Hydrolase: Water-driven breakup agents
Hydrolases act as the cellular cleanup team, breaking down large molecules by adding water. This process is essential for digestion, as hydrolases help decompose proteins, fats, and carbohydrates into smaller, absorbable components.
Lipase (EC 3.1.1.3), produced by Novozymes under the brand name Lipozyme® TL IM, breaks down fats into fatty acids and glycerol. This enzyme is primarily used in biodiesel production and food processing. In 2023, it generated annual revenue of approximately $1B and was utilized in 25% of biodiesel production.
One example for the hydrolase class in the EC repo is MFT5G, a high quality EC 3.6.5.n1 AI-generated elongation factor 4/LepA (violet cartoon). The AI enzyme was docked with GTP and compared with a natural, structurally similar (structdif 5.9) GDP-bound LepA (gray cartoon). As shown, the nucleotide binds to the same pocket as the natural enzyme, highlighting the potential of the enzyme generated by the molecule programming model.
EC4 Lyase: Waterless bond-breaking experts
Lyases are employed in the production of food, textiles, paper, and pharmaceuticals. These enzymes break down molecules without the use of water, often resulting in the formation of simpler molecules or new structures.
Recombinant L-Phenylalanine Ammonia-Lyase (PAL) (EC 4.3.1.24), produced by BioMarin Pharmaceutical Inc. under the brand name Palynziq, is used to break down excess phenylalanine in the treatment of phenylketonuria (PKU). This product generated approximately $304M in revenue in 2023.
Example MC0CE is an EC 4.2.99.20 lyase (magenta cartoon) from the EC repo, and is a novel text2enzyme SHCHC lyase. It belongs to the synthase subclass of lyases. The enzyme’s product docked in the same pocket as with the crystallographic SHCHC-complex structure (gray cartoon).
EC5 Isomerase: Shape-shifters
Isomerases rearrange the structure of molecules, transforming them into different forms, or isomers, without adding or removing components. This ability to rearrange atoms is crucial for converting molecules into forms that cells can more effectively utilize.
Glucose Isomerase (EC 5.3.1.5), produced by Novozymes under the brand name Sweetzyme®, catalyzes the conversion of glucose to fructose. This process is crucial for producing high-fructose corn syrup, a widely used ingredient in many foods and beverages.
For the EC5 class, a novel text2enzyme EC 5.1.1.13 isomerase, MJZMP, designed to catalyze the D-to-L-aspartate interconversion is featured (purple cartoon). The AI-generated aspartate racemase was docked with aspartate and compared with a close natural glutamate-bound aspartate/glutamate racemase (gray cartoon). The functional score nlmsim 90 and the high structural similarity (structdif 9.4) to the known racemase with the same function indicate the potential of the novel enzyme for the target function.
EC6 Ligase: Bonding pros
Ligases are the master builders of the cell, responsible for joining two molecules together. They act like molecular glue, facilitating the bonding of everything from DNA strands to proteins. Ligases are also essential for DNA replication and repair, as they join DNA molecules by forming bonds between their ends.
DNA Ligase (EC 6.5.1.1), produced by QIAGEN under the brand name T4 DNA Ligase, is essential in various biotech applications such as next-generation sequencing, molecular cloning, and DNA amplification. These processes are crucial for genetic testing related to ancestry and health. In 2023, this product generated approximately $1.2B in revenue.
An example from the MP4-generated ligases is the featured MY7V3, which is an EC 6.3.2.4 ligase that shows decent sequence divergence (seqdif 46.6) from the closest natural protein sequence. The ATP docked to the AI-generated D-ala — D-ala ligase (pink cartoon) binds to the same pocket and in a similar conformation as compared to an experimental structure of an ATP-bound natural D-ala — D-ala ligase (gray cartoon).
EC7 Translocase: Traffic controllers
Translocases are essential for moving molecules, ions, amino acids, and proteins across cell membranes, ensuring the crucial separation between the inside and outside of cells. Without them, this separation would be disrupted.
Fe3+-Transporting ATPase (EC 7.2.2.7), produced by Novartis under the brand name Jadenu®, is used in chelation therapy to remove excess iron from the body by suppressing the activity of Fe3+-transporting ATPases. This product generated approximately $870M in revenue in 2023.
The example AI-generated translocase MM131 (orange cartoon) is an ABC-type transporter with a seqdif score of 46.7, showing high structural similarity (structdif 15.6) to a naturally-occurring ABC-type transporter. The potential of the AI-generated enzyme is also supported by the high protNLM-based function prediction metric (nlmsim 90). Compared to an ATP-bound crystal structure of a known, close natural transporter (gray cartoon), ATP docks in the same pocket and similar conformation.
Beyond enzymes
While enzymes are central, there is much more to what proteins can do for us. And while natural enzymes have been the starting point for all of our useful products so far, with the the rise of AI, a new era of machine made solutions is on the horizon.
References
- 310 Copilot documentation
- 310 MP4 documentation
- Merck Full-Year 2023 Financial Results
- The Novozymes Report 2023
- Johnson, M.E. (2017). “A 100-Year Review: Cheese production and quality”. Journal of Dairy Science. 100 (12): 9952–9965. doi:10.3168/jds.2017–12979
- BioMarin Pharmaceutical Inc. 2023 Financial Report
- QIAGEN Annual Report 2023
- Novartis Annual Reports
- Roche Financial Report 2023