Biology at Scale: 5 Reasons Why We’re (Finally) Past the Hype

Shoman Kasbekar
12 min readAug 25, 2020

--

Technology and Scale’s love affair is well-known, but their adoption of Biology has had its rough patches. Why is now different?

Source: Bioinformatics

From the early days of high throughput compound screening, to the more recent promises of genomics in precision medicine, the biotech industry has tended to overestimate the impact of “scale” on discovery.

This is scale in experimentation, process, synthesis, and computation. Scale that has emerged in the forms of crafty chemistry combinatorics to create millions of compounds, gargantuan screening facilities capable of speedy readouts, massive omics datasets, and much more.

And those are the legitimate efforts. I’m not even going to get into the unproven claims of scale that can pop up in biotech, à la a certain company ending in “heranos”.

While not entirely in vain, attempts at transforming biological discovery with scale have often fallen short, or at the very least taken far longer than expected to buck the hype cycle and make it out of the trough of disillusionment. The massive data sets traditionally produced (or pieced together) as a result of biology at scale have been in many ways flawed, resulting in limited translational value. Further still, there has been a gap between the massive data we currently have available, and our ability to translate that data into something we can run tangible experiments against.

Biology at scale promises data, info, and knowledge…when does this become insight, wisdom, and impact? Source: Gaping void

In other words, More is not Better, biology is complex, and the biotech industry has the costly clinical failures to show for it. Adding insult to injury, the term “Eroom’s Law” has been coined, contrasting the negative trends in drug discovery success rate with the ever-upwards climb of compute power in accordance with Moore’s Law.

But there’s a light. Even in the past decade, something of a reversal in the success rate of drug development has occurred.

Eroom’s Law may be on the verge of reversal. Source: Ringel et al

There are many reasons for this potential about-face (friendly FDA, rare genetic disease focus, etc.), but the industry’s re-imagining of what scale means is one reason that becomes ever-more relevant. We are now at a juncture in how we generate and interpret biology at scale.

We are moving out of an era of brute force in biological experimentation, and into a new era of relevant, intelligent, and validated scale in biology.

This application of scale is poised to actually disrupt biological discovery productivity. In this new era terms like “high throughput”, “massive”, and “automated”, will prove real merit, instead of triggering immediate skepticism in the eyes of your friendly neighborhood pharma executive.

We will see functional genomics platforms identifying and validating biological targets at unprecedented speeds; the advent of relevant computational approaches quickly narrowing solution spaces; the rapid, intelligent optimization of technologies that up-level our control over biology. And much more.

A healthy amount of caution is necessary though — scale will never be the sole proprietor of the biological discovery engine. It will be the fuel that powers it.

But if we proceed vigilantly and point our ship in the right direction, an era of relevant, intelligent, validated scale is poised to revolutionize biological experimentation, process, and computation to transform our knowledge of sub-cellular omics, cells, systems, and bodies.

What is catalyzing this era of relevant, intelligent, and validated scale?

  1. Biologically relevant datasets are revealing subtle insights
  2. Multiparametric and multiplex experimentation is transforming the validation and throughput of biology
  3. “Intelligent automation” is enabling reproducible and optimized biology
  4. “Full-stack” biotechs are modularizing data generation
  5. Unbiased data and computational advances are converging to create predictive models of biology

#1: Biologically relevant datasets are revealing subtle insights

A certain founder of Microsoft was once quoted as saying —

Automation applied to an effective operation will magnify the efficiency. Automation applied to an inefficient operation will magnify the inefficiency.

A parallel can be applied to biology —

Applying scale to generate irrelevant biological data will distort your findings. Applying scale to create relevant biological data will enable subtle insight.

Biologically relevant data is that which provides the most representative view of how our bodies function. This data is free of artifacts, is limited in noise, and is often derived from models with genetic, regulatory, metabolic, spatial, and temporal characteristics representative of our own internal machinery. Our ability to obtain this biologically relevant data has been greatly enhanced in recent years due to advancements leveraging novel integrations of chemistry, microfluidics, and microscopy. Such technologies have already enabled massive progress in the field of genomics, and we are now in a golden age of their application to all areas of biotechnology. The result has been the generation of biologically relevant data at scale, including —

  • Physiologically relevant data, produced from representative models such as pluripotent stem cells, primary cells, co-cultures, and organoids.
  • High-resolution data, using techniques such as single-cell analysis, spatial omics, and high content imaging.
  • Genetically validated data, particularly enabled by the advent of genome perturbation tools such as CRISPR. Coming from a time when we tested many hypotheses against few genetic backgrounds, we are now flipping the script and realizing the biological implications of genetic diversity.
  • Temporal data, captured over time as opposed to an inconclusive snapshot. This could include more frequent timepoints over the course of a gene expression experiment, or tracking patient biomarkers against disease progression over the course of 10 years.

Among many others. Through such data, we will be able to tease out signal from noise, and actually derive meaning from biology at scale instead of compounding the confusion.

#2: Multiparametric and multiplex experimentation is transforming the throughput of biology

I know, this article started by calling out the shortcomings of some of today’s massive, multi-dimensional datasets. Sure, these datasets have aggregated various measures of genome, epigenome, phenome, metabolome, etc…but they have done so in a highly piecemeal way. When even the way a pipette is held can impact an experiment, the batch effects, lack of standardization, and inherent variability in broadly aggregated datasets can make their findings indicative, rather than conclusive.

This is not to say existing troves of data are useless — they are invaluable, but only to a point. Systematic, incremental data generation will be critical for realizing the full value of aggregated datasets. Such incremental data generation 1) validates insights from aggregate datasets, and 2) fills the gaps.

Novel advances in multiparametric and/or multiplex experimentation platforms are addressing this need for incremental data generation, through the collection of massive, richly described, standardized datasets in single-experiments.

  1. Multiparametric experimentation involves the collection of multiple, potentially orthogonal readouts at once. We are now frequently seeing such variables being measured simultaneously — including tandem measures of cell morphology, cellular motility, gene expression, spatio-temporal variability, and more.
  2. Multiplex experimentation involves the simultaneous processing of many biological events or components of a single type (e.g. sequencing of many pieces of DNA in parallel, identification of many cell surface markers at once, detection of many different metabolites in unison). In particular, we are seeing an emergence of “library on library” screening approaches, in which target libraries (usually proteins) are screened against a modifying entity (antibody, small molecule, T cell receptor, etc.)

These “multi-experimentation” approaches are well-suited to increase the standardization, throughput, and validation of generated data. By maximizing efficiency and minimizing confounding variability, such datasets are ideal for validating the implications — and filling in the gaps — of our existing knowledge. Multi-experimentation even goes a step further in the value it adds; when multiple variables or multiplex measures are collected in parallel, the dataset returns future value beyond the immediate study.

Notably, multiparametric/multiplex scale is valuable in the context of our modern understanding of disease. We now know that most complex diseases are a result of many genetic, epigenetic, and environmental variables.

Multiexperimentation can provide the scale and context to both identify and drug multi-factorial causes of disease.

Multiparametric and multiplex platforms are reinventing the value of scale. Don’t take my word for it though — take a look at just a few of the groups revolutionizing the generation of multi-parametric and multiplexed data.

  • Multiparametric: Recursion Pharma is powering the collection of dozens of cellular phenotypes simultaneously by leveraging high content imaging and machine learning-driven informatic pipelines.
  • Multiparametric: Freenome integrates assays for cell-free DNA, methylation, and proteins with machine learning techniques to understand additive signatures for early cancer detection.
  • Multiplexed: Octant Bio is evaluating the impact of single molecules across thousands of GPCR targets in unison, in an effort to find the molecules that may best treat multi-factorial conditions such as neurodegeneration, and obesity.
  • Multiplexed: Tango Therapeutics is performing high-powered pooled CRISPR screens to evaluate the effects of genotype perturbation on phenotypic effect, in thousands of genes simultaneously.

#3: “Intelligent automation” is enabling reproducible and optimized biology

While recently the spotlight has fallen on various robotic laboratory companies touting automated platforms, automation is not a novel tool in biotech. Traditionally though, automation has been applied towards relatively simple experimentation — think of DNA sequencing and synthesis, or small molecule compound screens in immortalized cell lines.

Today, there is a movement towards intelligent automation of experiments, which can be attributed to progress in two areas —

  1. Integration of sensors, readouts, and longitudinal data collection as part of automated workflows.
  2. Algorithmic optimization of automated workflows based on collected data.

Firstly, such progress allows relevant biology data to be collected in a standardized manner, at scale. Given that over $28B/year is spent on non-reproducible biomedical research in the US alone, this standardization is critical. Secondly, by continuously optimizing experimental parameters, the best possible experimental biology protocols for generating data with minimal noise can be found.

Most interestingly though, intelligent automation and iteration enable the rapid iteration of biological tools, technologies, and products.

Intelligent scale of experimentation can identify the key factors to modify in a given biotechnology, and then optimize those factors themselves.

Consider an example here — the engineering of an allogeneic, gene edited cell therapy. While the end goal is to be able to engineer a cell capable of destroying cancer cells, the first step would be to identify the “tool” with which you would genetically edit modifications into the cell. Such tools could include CRISPR/Cas9, TALENs, ZFNs, etc. To optimize a chosen tool, intelligent, automated experimentation would enable you to both identify the most important variables to optimize (e.g. ideal transfection conditions, gene editing components, editing enhancing reagents, etc.), and then to optimize the variables themselves. The resultant optimized technology toolkit could be used to perform complex edits such as site-specific gene knock-ins, multiplex genetic edits, and more, enabling the optimal designed therapy.

Intelligent automation can optimize genome editing tools to achieve optimally edited cellular therapies Source: Satpathy et al

Such an optimization approach is relevant to many biological applications — viral vector design, enzyme engineering of designer nucleases, fermentation bioreactor processes, nanoparticle delivery formulations, and more. Thus, faster development cycle times attributable to intelligent automation will contribute to better biology tools, products, technologies, and therapies.

#4: “Full-stack” biotech platforms are modularizing data generation

The concept of “full-stack” comes from the software world, meaning from back-end (databases and architecture) to front-end (customer interface), connected by software between.

The concept is relatively novel in the life sciences though, and there are two key components to full-stack approaches in the life sciences —

  1. Vertical integration of experimental workflows and reagents. Piecemeal workflows lead to inconsistent outcomes, and even biological components such as enzymes can carry distortion. Full-stack biotech platforms are now realizing the value of integrated experimentation from design → performance → analysis. By modularizing each of these steps, full-stack biotechs are able to integrate particular modules to achieve reproducible experimental outcomes of interest, at scale.
  2. Feedback loops enabling troubleshooting, continuous improvement, and “data flywheels”. Full-stack hardware, and the integrated software “thread” that flows through it will enable the collection of data along the entire pathway of experimentation. Through such data collection, both troubleshooting and continuous improvement of quality, throughput, and signal occurs. Further, full-stack biology approaches enable “data flywheels”. In such a flywheel, each additional data point generated by the platform makes the subsequent data point easier to generate.

One field that has benefited significantly from full-stack approaches is that of synthetic biology. Here, biotech groups have integrated aspects of reagent engineering, experimental design and execution, and output application. Example synthetic biology companies leveraging full-stack platforms to perform experiments at scale include Synthego, Asimov, and Ginkgo.

An example of a full-stack biotech approach to improve throughput/reliability, as applied to synthetic biology Source: Jessop-Fabre et al

#5: Unbiased data and computational advances are converging to create predictive models of biology

The scientific method (hypothesis → test hypothesis → analyze) has yielded many discoveries over time. Key word, discoveries. As such we often uncover biological insights through a combination of ingenuity and good old luck. Now, a fundamental shift is occurring in the way biologists, engineers, and computer scientists are deriving insight from biology at scale.

In this new paradigm, the intended consumer of experimental data is not a scientist; it is an algorithm.

Computational techniques have sometimes been applied in haphazard ways to biology. Despite this, many clever applications have risen from the scramble. Machine learning, for example, has been effectively applied to diverse challenges such as classification of high content cellular images, predictive diagnosis from multi-omic datasets, and virtual compound screens for de novo designed drugs. Such computational approaches are well-suited for deriving meaning from complex, multi-dimensional datasets, and have advanced in tandem with progress in compute power and access.

Understanding complex biology at scale would be fruitless without advances in computation. Source: Goff

In the new era of scale, more efficient and relevant experimentation is enabling us to generate datasets perfectly suited for ML-based interpretation.

Such data sets are tagged with rich descriptors, in multiple layers, and include both positive and negative experimental outcomes (unbiased). These data sets are accompanied by contextualizing metadata that provides valuable insight into the journey of the data itself (from creation to processing to curation). These data sets are of immense size, and are being produced at unprecedented cycle times, further strengthening the algorithm’s capability to make predictions.

We as humans are notoriously poor at understanding causation. With the right data sets enabled by advances in scale, and the right application of computation, we will make massive advances in realizing the relationships in complex datasets.

What will scale in biology actually achieve?

Fundamentally, relevant, intelligent, and validated scale will provide biology researchers and biotech companies with two tangible advantages —

  1. Generation of confidence-inspiring and novel data packages. Economic value in biotechnology lies squarely with the clinical assets that are generated, and scale will never be a substitute. Even in the early days of genomics, companies that pioneered scale (e.g. Celera) were outlasted by more asset-driven companies such as Plexxikon and Exelixis. Scale can however socialize concepts such as faster target validation, more descriptive data packages, and the use of data from genetics and other relatively novel spaces. Overall, scale will influence a biology researcher or drug developer’s conviction in a particular hypothesis.
  2. Optimization of experiments, technologies, and platforms. Scale will bring forth a new era in our ability to develop biotechnologies themselves. Intelligent iteration will be immensely enabling for companies looking to achieve outcomes from calibrating a functional readout, to designing a novel viral vector.

In a commercial sense, biotech groups effectively leveraging scale will have the opportunity to capture both upstream and downstream value in the biotechnology value chain (from early research → clinical assets). As data is accumulated from valid experimentation at scale a competitive moat may be formed from the intellectual property, allowing such a group to generate further economic value.

We are starting to see the first real impact of relevant, intelligent, and validated scale on the success rate of biological discovery. Despite this, the long-term impact is yet to be understood.

We must identify the tools, technology, and platforms that usher in a new era of scale. We must also ID the areas of biology and applications in biotechnology that are most amenable to disruption.

Biological discovery rarely happens fast, but as we build up these proofs we may see a true snowball-effect on the pace at which we are able to understand, derive insight from, and act on biology for the betterment of human health.

--

--

Shoman Kasbekar

Investor @Foresite Capital, Alliances @Rare Genomics Institute Previous: Strategy @Synthego https://www.linkedin.com/in/shoman-kasbekar/