Data Science at Zymergen
Zymergen is an SF Bay Area startup that uses software, robotics, and advanced genetic engineering techniques to make industrial microbes more effective at producing particular chemicals, or even to create brand new compounds. I recently joined Zymergen to manage the Core Infrastructure and Data Science teams.
Effectively achieving our company goals requires applying a variety of disciplines that comprise “data science” to interesting, difficult problems. This post lays out some questions we are working on, and the kinds of backgrounds a data scientist might need to tackle them. My hope is that this will be useful to people who’ve chosen a data science career (especially ones considering applying to Zymergen!), and perhaps also an interesting read for people with a general interest in the field.
Industrial fermentation & genetic engineering
We encounter products created by microbes on a daily basis. They are involved in making bread, cheese, wine, and chocolate. Your romantic wine country getaway is basically powered by microbes.
In an industrial setting, “industrial fermentation” is used to produce a number of common chemicals. Most commercial ethanol, penicillin, citric acid, and insulin are produced by specialized microbes. The range of applications is very wide — fuels, adhesives, pigments, medicines, and more.
Insulin is an example of a well-known product that is produced by genetically engineered microbes: bacteria had a segment of human DNA inserted that makes it produce insulin for use by diabetic patients.
Consider the process the scientists had to go through to get this technique to work, and how such approaches can be made repeatable with regard to other interesting chemicals. How do we figure out which bit of DNA to insert, and how do we validate that the whole thing worked? Once we have the first compound-of-interest-producing bacterium, can we make other changes to its DNA that would make it produce more, faster? Historically, this has been a very labor-intensive, trial and error process.
Zymergen radically speeds up the process of creating and improving specialized strains. We massively parallelize introduction of small genetic changes into microbes and evaluation of results of those changes. More trials, more errors, more successes, less human effort. Lots of data. We can do this because:
- Recent scientific advances make it possible to introduce specific DNA changes to specific locations in the target DNA, and to do so with high precision.
- Robots perform operations faster, with higher parallelism, and more consistently than humans.
- Software can automate orchestration and tracking of complex multi-stage processes, guiding said robots.
- We carefully collect data on all experiments we run, which we turn into insights into what is most likely to work next, and which avenues are most fruitful to explore.
- With a powerful, parallelized test factory at our disposal, we can turn from hypothesizing which specific changes are most likely to result in the desired effect, to hypothesizing what kind of changes will work — then try many such changes, and learn from empirical evidence.
We believe that our focus on complete end to end automation, data collection, and an empiricism-driven approach to organism engineering, sets us apart. This belief is validated by the fact that all this actually works, and have delivered new, improved strains to industry partners.
The Role of Data Science
So, what do data scientists DO at Zymergen? After all, we are not trying to get people to click on ads, and the only growth hacking we do is the Petri dish kind.
We tend to think about our work as a large design, build, test, and analyze cycle (“DBTA”). New strains are designed using custom software; they are then built in our “factory”; this is followed by testing to determine results of applying the change; all data is collected and analyzed, leading to another design pass, and the cycle repeats.
Data science is applied across the whole DBTA cycle.
1. Design: choosing candidate strains
The first thing to do is to design some new versions of microbes to try out.
Rather than propose specific mutations, our research scientists design strategies for finding effective changes that are generally applicable. For a given project, the scientists can choose from available strategies, and auto-generate a large volume of changes to try. The strategies are informed by expert domain knowledge; different strategies work better in different situations.
For example, we might consider an “insert an arbitrary 3-letter sequence at the beginning of a known gene” strategy (note: this is not a very good strategy). The DNA alphabet consists of 4 letters A, C, T, and G. There are 64 possible 3-letter combinations of 4 letters. For a microbe that has 4,000 known genes, this strategy would mean trying 256,000 different changes.
Our system is built for high throughput, but we are still fundamentally bound by time and space: the machines take a certain amount of time to set up and carry out their tasks; the cultures take a certain amount of time to grow. We can increase throughput by adding more machines; still, the robots are pretty expensive, and so is San Francisco Bay Area real estate.
Let’s say we can try 1,000 genetic changes at once. We are immediately confronted with a problem:
- How do we choose 1,000 changes out of the total possible set?
Although we can eventually run through all 256,000 options, ideally we can try the most promising ones first — if we find a sufficiently large improvement early, we may not need further rounds. Some notion of ranking of candidate changes would be helpful here.
- In absence of experimental data, what features can be used to identify mutations most likely to succeed?
- Can we use data and outcomes from past experiments to better inform ranking of untried changes?
Ranking and search approaches used for a model like this have some peculiarities. For example, unlike the Internet search case, we care less about strict ranking than we do about overall precision per 1000 items. The features such a model would use are non-obvious, and many are noisy.
2. Build: making new organisms.
Once we know what we want to build, we need to build it!
As you can see from the DBTA cycle diagram, the blue (build) process involves several discrete steps — creating DNA fragments, assembling them, and incorporating them into microbe DNA. Many sophisticated techniques, robots and operators are involved in making this possible.
The process can be monitored and improved over time by improving machine utilization, evolving operating procedures, and otherwise optimizing for process efficiency.
- How can we best allocate the workloads to individual machines?
- What is the best schedule for starting individual stages of experiments to optimize overall throughput and eliminate bottlenecks?
- Some machines can be used for multiple purposes — when is it appropriate to pay the set-up costs of switching them between different roles?
Even with the best process, some of the steps involved in making changes to microbial DNA have variable outcome quality, and we need to screen out cases in which the outcome does not meet our standards. When outliers are detected, we will want to investigate root causes (Contamination? Unexpected side effect? Poorly calibrated instrument? Process issue? Bad luck?).
In some cases, directly validating output quality is very expensive. Instead, we can perform cheaper, approximate measurements that are indicative of success. We then rely on a machine learned model to determine whether these measurements combine to suggest that the operation went according to plan.
Data scientists help design and evaluate different ways of assessing quality of our process outcomes and the processes themselves.
We also talk about data quality, which addresses slightly different, self-directed questions.
- Are we measuring the right things?
- What is our measurements’ accuracy, and can / should we increase it?
- Is there other data we should be tracking?
Thinking about data quality can lead to fundamental changes in our approach to Quality Control. For example, most assays are taken at set time points in a process. This leads to a QC approach that is structured around the concept of gating tests. However, some measurements that can be taken continuously, either out of the box (like freezer temperature) or through clever automation, like extracting signals from a video stream.
A data scientist might start wondering:
- What extra information can we get through converting a batch process to a streaming approach?
- Is it worth the amount of work involved in automation and implementation of the new process?
- How do we evaluate potential gains from introducing such a change?
Process-related questions generally fall under the general heading of “Operations Research” and “Industrial Engineering.” These are fairly well-developed fields with a long history and a large and fascinating body of research. A lot of work has recently been published in this area in the context of optimizing order fulfillment for large online retailers like Amazon (here’s a blog post from some folks at Zalando who threw — what else — deep learning at the problem). Optimizing the process, controlling quality of builds, and continuously improving the quality and kind of data we collect is critical to having an efficient, repeatable factory.
3. Test: Experiment design.
Having designed and built the strains we want to try out, we need to actually measure their performance.
Biological processes themselves are famously nonlinear and subject to variation. The amount of product the same strain of the same organism might produce is best described by a distribution than a constant.
- Concurrently running multiple experiments on a shared set of robots can subtly bias results. What is the appropriate strategy for carrying out concurrent experiments?
- How many copies of each strain do we need to test in order to trust our results? For that matter, what does “trust” mean at different steps of the process, and should we vary our passing criteria?
- How do answers to these questions change for different base organisms and organism design approaches?
The majority of the new strains we build are tested in a high-throughput manner, which optimizes for the ability to test many things at once over perfectly accurate results. Scaling industrial processes down requires some changes, which introduces a degree of divergence. The final test step listed in the DBTA diagram, “cultivate in bioreactors”, involves moving the “best” strains to a lower-throughput environment, where experiments are more costly but more accurate.
It is important that by the time we move to testing a strain in larger volumes, we have a high degree of confidence that it is worth the investment. Since many fewer experiments can be run in large tanks at the same time, and the per-experiment cost is much higher, we want a high degree of accuracy in our strain selection process at this point. Ideally, all of the strains we test would be high performers, and none of the strains we don’t test would perform better.
Experiment design is of course a classic Statistics problem, with many different methods and approaches developed over the years. Zymergen data scientists need to find the appropriate testing methods that match our particular operational and business conditions: a large number of treatments, a constrained supply of trials for each treatment, potentially correlated batches of robot “runs”, etc.
4. Analyze: incorporating new data
Recall that in the Design phase, we had 256,000 candidate changes, 1000 of which we decided to try out. At the end of the Test phase, we have results from these changes, which we can use to update our picture of the world in preparation for the next Design phase.
We have some decisions to make.
- Do we try another 1,000 changes from the original list?
- Should we try pair-wise combinations of the 1,000 individual changes we already tried? That’s about 500,000 possible combinations. How do we rank these?
- Do we try some mix of changes from both lists? What are the appropriate ratios?
Even the simplest-seeming of these questions — ranking combinations of changes we already observed — is fiendishly difficult. Combining two mutations can result in the sum of their individual effects, in the min or max of the two, or even in canceling out! When it comes to genetic changes, 1 + 1 may equal -1, 0, 1, or 2. In fact, 0 + 0 might equal 1.
These outcomes are not arbitrary — they are governed by complex, hard to model, and often unknown processes. Size of combined effect is extremely hard to predict without explicitly measuring it, even when effects of changes in isolation are known.
To further complicate matters, we may be interested in trying some changes that are not likely to result in a significant performance boost, but will give us some new information about the organism and how it reacts to certain classes of changes. Rather than optimizing for maximal gain at every pass through the DBTA loop, we want to find the “efficient frontier” between performance gain and knowledge gain. Depending on our objectives at any given time, we can choose where on the frontier we want to land — exploit our existing knowledge, explore promising unknowns to expand our body of knowledge, or perform some combination thereof.
The better our model for predicted effect of changes is, the faster we can get to most effective microbe strains, and fundamental new discoveries. In fact, our models themselves, past a certain level of accuracy, become discoveries, as they implicitly encode previously unknown patterns in microbial genetics!
We are hiring
This post is an attempt to briefly discuss the ways that Data Scientists contribute to Zymergen’s mission across all four parts of the Design-Build-Test-Analyze cycle. The problems we tackle are full of challenge and nuance, each with significant implications and potential impact. Zymergen is placing a large bet on the potential of Data Science to change our business and our industry.
We are looking for great data scientists to join the team. We do not require a background in biology. Great candidates might have a background in applied statistics, machine learning, operations research, biostatistics, astrophysics, or a number of other fields. Generally, they will have a wealth of experience applying numerical analysis methods to real problems, and dealing with complex, noisy data sets. The problems they solved in the past can range from social network analysis, to recommendation engines, to machine translation, to industrial process monitoring, to vehicle fleet routing.
In building the data science team that will tackle the variety of challenges we have before us, we are pursuing a simple strategy: hiring people with different backgrounds, deep expertise in their specific sub-specialties, and a healthy amount of interest and demonstrated ability in working outside of their domain.
Much thanks to reviewers who gave feedback, especially my wife, who doesn’t hesitate to tell it like it is, but is very nice about it.