MP4: AI Text2Protein Breakthrough Tackles the Molecule Programming Challenge

310 AI
310 AI
Published in
7 min readJul 16, 2024

Written by Kathy Y. Wei, Ph.D. and Koosh on 16 July 2024

  • 310 AI’s MP4 model compresses billions of years of life into a powerful AI that transforms text prompts into novel and useful protein sequences.
  • Originally developed in 2021, this breakthrough could revolutionize the application of biology in medicine, industry, and beyond.
  • Explore molecule programming with the MP4 repo, featuring over 1,000+ AI-generated, scientist-approved proteins.

Protein design and engineering focus on manipulating proteins to achieve specific functions — either by creating entirely new proteins from scratch (design) or by modifying existing ones (engineering). Understanding how to make protein functions controllable and designable has intrigued scientists ever since we first gained the ability to analyze their shapes.

In a major step towards a new chapter in protein sciences, 310 AI’s molecule programming model version 4 (MP4) is able to generate protein sequences from text prompts describing the desired function. The earliest version of MP was built in 2021, featured in 2023, and has grown significantly in capabilities. With just a simple sentence, it’s now possible to invent de novo sequences that not only match the desired function but also deliver high quality, excellent foldability, and uniqueness.

Many of the world’s greatest opportunities are fundamentally linked to protein functions. A fast, accurate way to create proteins with desired functions will revolutionize:

  • Medicine: Design drugs for currently undruggable disease targets.
  • Industry: Develop high-value chemicals and materials with non-toxic, sustainable, environmentally-friendly processes.
  • Environment: Deploy enzymes or whole microorganisms for environmental cleanup or nutrients recycling.
  • Research: Discover tools to tackle currently unanswerable questions.

In recent years, there has been a fervent focus on leveraging AI in protein sciences. The extraordinary success of AlphaFold has set the stage for a wave of breakthroughs in this dynamic field.

The Programmability Challenge

Protein engineering has long been closely linked to healthcare. Since the introduction of Humulin, a recombinant insulin for diabetes treatment in 1982, scientists have been eager to explore how modified proteins can lead to even better disease treatment. Fast forward to 2021, when NL-201, the first protein designed from scratch on a computer, entered clinical trials.

AlphaFold and AlphaFold3 represent significant advances in the folding problem (sequence > structure), while others, including ProteinMPNN and ESM3 address the inverse folding problem (structure > sequence). However, none have comprehensively tackled the programming problem (function > sequence) as MP4 has.

Despite these advancements, most protein design technologies — both experimental and computational — have relied on searching increasingly larger portions of sequence space. This approach is inherently unsustainable; for a typical protein of 300 amino acids with 20 possible amino acids, the number of potential proteins is 20³⁰⁰, vastly exceeding the number of atoms in the universe. This is where modern AI techniques offer the potential to overcome the limitations of traditional search-based methods. While 20³⁰⁰ is an enormous search space, it’s roughly equivalent to the number of options in a 98-word English paragraph. Modern natural language AI models can easily generate such text with high coherence and readability.

A Repository of Text2Protein AI Results

At 310 AI, we can take a text prompt such as “Ribosomal protein L33 contains a conserved site and is located in the chloroplast” and generate a full protein sequence such as “MAK GKD ARV TVI LEC TSC ERN GVN KKS TGI SRY ITQ KNR HNT PGR LEL RKF CPY CRK HTI HGE IKK”, using our proprietary molecule programming system.

Examples of text2protein generation

In particular, freeform human-readable specifications of a protein are processed into a proprietary protein language with a vocabulary that describes functions, properties, processes, families, domains, active sites, motifs, taxonomy, and more. While only a fraction of the vast freeform text protein programming space is understood by MP4, it is already successfully generating exciting examples from diverse and complex prompts.

Overview of repo functions

We are releasing our MP4 repo, featuring an initial 1000+ AI-generated sequences. More will be added soon!

Creating Quality Molecules

One of the major challenges in protein design and engineering is the lack of simple metrics to assess the “goodness” of protein sequence, structure, or function. The most well-developed metrics focus on protein structure, applicable to both experimental and computational results. Here, we use three main metrics to evaluate sequence, structure, and function, respectively.

  • aacomp (amino acid composition): Measures the distribution of each amino acid type in a sequence compared to UniProt sequences, with natural sequences scoring between 80–100. While this metric doesn’t consider the order of amino acids, it helps filter out overly repetitive sequences.
  • plddt (average predicted local distance difference test): Calculates the average per-residue local confidence for structure prediction (using ESMFold). A well-defined structure typically scores between 70–100. In many applications, a more structured protein is preferable.
  • nlmsim (ProtNLM function prediction similarity score): A text-based comparison between the input prompt and the ProtNLM output. Scores of 80–100 indicate exact or subset matches, while scores of 60–80 reflect similar-looking words. Note that this measure may miss synonyms or specific examples within a broader category, leading to false negatives.

The functional comparison, or nlmsim metric, is of most interest, as this is the self-consistency measure of how well protein sequences generated by MP4 align with the specified prompt. However, due to limitations in available protein function predictors and the challenges of text-to-text comparisons, this quality is the hardest to measure accurately. We look forward to improvements in the accuracy and standardization of this evaluation in the near future.

Overview of repo quality

Unlocking the World Beyond Nature

In addition to generating “good” proteins, we are also interested in creating novel ones. Novel proteins are appealing because they can achieve functions or properties that are not constrained by the forces of natural selection acting on natural proteins. For example, natural proteins rarely encounter conditions of extremely high concentration, while high concentrations are often necessary for medications to be effective. In addition, novel proteins are intriguing from a research perspective, as they expand our understanding of what these molecules can achieve, and may refine our notions of their potential applications. Here, we use two standard metrics to evaluate sequence novelty and structure novelty.

  • seqdif: Sequence novelty score derived from the percentage of identical matches (pident), which assesses sequence similarity between an input query and sequences in the NR/NT database using DIAMOND (a faster alternative to BLASTP). We define seqdif as 100 — pident for consistency, with a cutoff above 50 indicating a novel sequence. Although this threshold is not universally established, it is commonly used in de novo design. Note that this metric is sensitive to parameter settings.
  • structdif: Structure novelty score based on structural similarity (tmscore) between an input query and structures in the Protein Data Bank (PDB) via Foldseek. We calculate structdif as 100 — tmscore * 100 for consistency, with a cutoff above 50 indicating a novel structure, which is a standard threshold.
Examples that novel both by sequence and structure

We have found many notable examples of sequence and structural novelty, demonstrating that MP4 can explore beyond natural limits.

Overview of repo sequence and structural novelty

The Generalist Approach

While traditional biological research often emphasizes reductionism and narrow focus, a generalist approach has proved more successful in AI. Models concentrating on specific enzyme types or properties struggle to generalize to new inputs — an essential requirement for most applications. By developing models that encompass a wide range of potential inputs, we enhance the likelihood that our desired generalizations fall within the model’s input space. This allows for meaningful extrapolation, whereas attempts to generalize outside this space are typically unsuccessful.

We are seeing an explosion in the use of natural language text driven by AI technology, and we predict similar adoption in the scientific community. While the MP4 vocabulary has limitations, it is poised for continuous growth in size, scope, and complexity as more data — both public and private — are compressed and optimized.

MP4 is trained using 138K tokens and 3.2B unique datapoints across 70 synchronized tasks. This foundational model prioritizes molecule programmability and achieves state-of-the-art results with approximately 3,800 AMD-Instinct GPU-days.

MP4 architecture

The Next Horizon

MP4 represents a thrilling leap in making molecules programmable, but this is just the beginning. We’re on the brink of adding groundbreaking new controllers that will reshape our understanding of proteins, DNA, RNA, and small molecules — unlocking both natural and novel functions. Together with our collaborators, we’re diving into specific application areas where AI-driven creations can catalyze disruption in health, environment, agriculture, and more.

Just a few years ago, AI felt like science fiction. Today, it’s at the forefront of innovation. Tomorrow, AI will reshape the future of biology!

--

--

Responses (2)