Models for Navigating Vast and Tangled Biological Landscapes

Published in

Sestina Bio

6 min readAug 2, 2022

By Matt Biggs

Synthetic biology is overflowing with potential to change our world for the better. But between lab-scale demos and commercial blockbusters looms a vast, tangled, biological design space that has often proven to be a “valley of death”. At Sestina, machine learning (ML) is playing a foundational role in spanning that divide more rapidly and routinely. We see two ML strategies as essential to the next generation of synthetic biology: rich data and efficient design. The end goal of these strategies is to reduce uncertainty and risk along the path from laboratory to commercial production. Predictions by state-of-the-art ML model architectures, fueled by high quality data, are the key to efficiently navigating that tangled landscape.

Rich Data

Consider a classic challenge in biology: predicting protein three-dimensional structures. Amino acids — foundational building blocks of life — have a consistent structure across all organisms. But, once you join those building blocks into protein chains, the resulting folded structures become frustratingly unique. Every protein has its own structure, which relates directly to its function, and that structure is highly dependent on context. Direct measurements of protein structure through x-ray crystallography and other methods are expensive and slow. The physics simulations to model the dynamic process of a folding protein up to a steady state are still computationally intractable.

A key lesson for synthetic biology is the idea of including more orthogonal sources of data (data richness). In 2019, AlphaFold2 finally demonstrated scientifically-useful and scalable protein folding predictions. A key innovation was enriching the input data. Previous models were attempting to relate a single protein sequence to its 3-dimensional structure. The creators of AlphaFold2 included an additional source of information: an alignment of many related protein sequences. That set of related sequences carries additional information about the pairs of amino acids that are co-evolving and that have a higher probability of contacting spatially. With this additional information, AlphaFold2 outperformed previous efforts by a wide margin.

Worth noting, AlphaFold2’s success was driven by both richer data and a new model architecture that could effectively leverage the new data. AlphaFold’s attention-based “Evoformer block” architecture allows information from sequence alignments and contact maps to intermingle, allowing the model to learn dependencies between the two. In the same vein, new model architectures such as transformers are fueling advances in protein structure prediction and engineering by allowing efficient representation of higher-order feature interactions. At Sestina, we invest in state-of-the-art model architectures that can specifically leverage our rich data.

Figure 1. Rich data — paired with models that leverage data richness — are key elements to successfully applying state-of-the-art ML models to synthetic biology. High content imaging provides information about cell morphology. Metabolomics brings information about metabolic state. Tank fitness scores connect genomic changes to the ability to grow under manufacturing conditions. Laboratory titer measurements bring information about productivity. Efficiently gathering these measurements, guided by ML models with optimized architectures, allows us to navigate the vast, tangled landscape to commercially excellent strain designs.

We are increasing data richness in multiple, complementary ways (Figure 1).

We are pioneering high-throughput selection experiments where thousands of genetic backgrounds are simultaneously thrown into competition with each other in production conditions. With this data source, our models can connect genotype to fitness in fermentation tanks.
We are piloting a droplet-based metabolomics pipeline which allows us to measure metabolite concentrations across millions of cells, giving dense snapshots of metabolic outcomes. With this data source, we are able to capture more diversity from strain libraries with fewer screening resources.
We are applying state-of-the-art high content imaging (HCI) and image analysis techniques to understand how cell morphology relates to manufacturing performance (in much the same way Recursion analyzes images of human cells for drug discovery). In this case, subtle clues in cell shape and staining texture can be leveraged to understand cell physiology. Given the low cost and high throughput of microscopy, we can complement, or even replace, slow or expensive assays with ML image analysis.
As a final example, we emphasize holistic genotyping from the entire distribution of edited cells, not just the winners. Imagine ignoring the wealth of lessons from failed strains — that’s what traditional synthetic biology has been doing with their genotyping strategy! By genotyping weak strains, our ML models will learn to recognize both beneficial and harmful edits.

Efficient Design

A second key lesson for synthetic biology is the idea of sampling more efficiently. In 2017 AlphaZero was the first program to beat human world champions in chess, shogi and go. One of the strategies the creators of AlphaZero leveraged to achieve superhuman play was efficient lookahead by Monte-Carlo Tree Search harnessed to deep learning models that had been trained (on millions of games) to accurately evaluate board positions. Tree search is a smarter way to sample from a computationally intractable set of possible moves. ML models pruned the least promising branches and recognized the early glimmers of promise. This efficient consideration of possible actions allowed AlphaZero to select winning moves while considering several orders of magnitude fewer positions than previous world-class programs.

Figure 2. Games have advanced from brute-force heuristics to efficient lookahead, resulting in superhuman play (AlphaZero). Protein folding has advanced from brute-force simulations of small peptides to inferences based on evolutionary patterns among related sequences, resulting in accuracies rivaling crystallography (AlphaFold2). Synthetic biology is advancing from simple data sets and regression models, to multidimensional strain optimizations, made possible by Sestina’s rich data, state-of-the-art models, and efficient design.

At Sestina, we are pairing smart search patterns with machine learning (efficient design). To be sure, we are vastly increasing the throughput of our assays (cells, not wells). But strain engineering searches a space of immense complexity — the number of possible edit combinations in a yeast cell outstrips the combined complexity of chess, shogi and go in the same way those games outstrip tic-tac-toe. And biological measurements are expensive (think transcriptomics, metabolomics, proteomics, and production-scale fermentation runs — each costs thousands of dollars per sample). No conceivable throughput could ever be enough to sample that essentially-infinite space. It is essential to intelligently constrain the design space.

We approach this challenge like AlphaZero: First, Bayesian optimization is a framework for efficiently searching a design space, analogous to AlphaZero’s tree search; Second, we are training ML models to “prune the least promising branches” by avoiding broken physiologies and recognizing early glimmers of blockbuster strain designs. For example, our HCI technology can recognize many modes of metabolic dysfunction. We are uniquely able to train ML models to learn transferable biological patterns because of our massive, rich data set. Sestina’s efficient search and powerful models combine into an efficient design framework that allows us to select exponentially more winning designs while requiring orders of magnitude fewer experiments.

Conclusion

Our company’s incredible progress is largely due to the disciplined blending of diverse expertise and cutting edge technologies. For example, our world-class data science platform benefits from the expertise of our collaborators at Foresite Labs, our strain engineering platform from our collaboration with Inscripta, and our fermentation platform from our partners at Culture Biosciences. Sestina Bio is the crucible where the next generation of synthetic biology is being alloyed.

To close, I gravitated to synthetic biology because of the immense positive impact it will have on human well-being. Biology as a manufacturing platform can heal the environment, improve health and revolutionize agriculture. To make that vision a reality, we have oriented all of our activities to build the data assets and algorithms necessary for ML to blossom as a tool for engineering life (Figure 2). Learning algorithms will help us navigate the complex, high-dimensional, tangled space between where we are to that future we hope for.

Models for Navigating Vast and Tangled Biological Landscapes

Written by Sestina Bio