How LabGenius is revolutionising drug discovery with machine learning

Harry Rickerby
Apr 29, 2020 · 11 min read

Machine learning has the potential to provide huge value to drug discovery, in particular by reducing its failure rate and reversing the 60 year trend of decreasing R&D efficiency, as outlined in my previous post here.

But let’s start with the basics of how drugs have been made until now. Given the current COVID-19 pandemic, you will probably have heard more about how drugs are developed than ever before, as well as some of the issues associated with it. Here, we’ll give an overview of drug development and also discuss what we need to consider in order to change — and massively improve — how this process is done.

Drug discovery is actually just one aspect of a much larger development pipeline that contributes to the huge and growing cost of taking a drug to market. Most large Pharma companies are vertically integrated and do the lot, but they also partner with smaller enterprises who devote their time and energy to particular verticals.

Here at LabGenius, we’re focusing our expertise on drug discovery — more specifically new biological drugs, also known as biologics.

What makes an effective drug?

Developing a biological drug means cross-optimising for a host of characteristics. While some of these factors may not appear critical at first, it is vital that your drug doesn’t fall down on any of them, otherwise the chances of failure increase. Here are some of the most important characteristics that we consider:


Potency is a measure of how well your drug does the thing you want it to do. Many biological drugs are designed to block an interaction between two molecules in the body, while others may seek to activate the immune response. A high potency also allows doctors to prescribe lower doses, decreasing the chances of side-effects.


The specificity of a drug is the extent to which it affects only the target it was intended to. A drug with low specificity is likely to have unintended, potentially toxic, effects by sticking to other molecules in the body. Generally speaking, when discovering new drugs, we aim to increase the specificity as much as possible.

Expression levels

Cells are used as tiny factories to produce biological drugs. Right now, it’s really difficult to predict which proteins (more on these below) will be expressed at high yields, which will fail to express at all, and which fall somewhere between. Expression levels are important since they define whether a protein can be produced at a cost that will allow it to be economically viable.


All proteins have a shelf-life — over time, they will degrade and lose their potency. This degradation rate depends on both the protein itself, and the environment it is in. Some proteins may remain potent for years when stored frozen, but once in the body, they may degrade in minutes. It is all well and good having an incredibly potent molecule, but if it only survives for a few minutes, it is unlikely to be an effective drug.


Immunogenicity is the potential of a molecule to provoke an immune response. When injecting a protein into your body, your immune system can recognise that it isn’t one of its own and will respond by eradicating it. This will mean the drug won’t last long in the body, and therefore will have much less of an effect. In some cases, the provoked immune response can actually cause the patient harm.


Sometimes, proteins stick to one another. This can happen immediately after the proteins are expressed, or it can happen gradually over time, while the proteins are being stored or transported. Aggregation is a huge problem since it negatively impacts all of the characteristics we talked about above. A protein that aggregates is likely to be less potent, less stable, less specific and more immunogenic.

The beautiful simplicity, and infinite complexity of proteins

So when we talk about ‘proteins’, what do we mean? Every single organism on planet Earth uses proteins to get stuff done. Organisms are made up of cells, and proteins do most of the work within them, influencing their structure, function, regulation and replication.

Proteins come in all sorts of shapes and sizes, contorting themselves into highly complex and specific configurations. The underlying physics that govern how and why proteins fold in the ways that they do is an immensely difficult task and has been a subject of significant research.

The beauty of proteins is that, once you unravel their complex 3D structures, they are actually very simple. A protein can be thought of as a linear string of coloured beads. At each position on the string, a bead could be one of 20 different colours. On average, a protein is around 400 ‘beads’ in length.

In this metaphor, our coloured beads represent amino acids — the chemical building blocks that make up proteins. These 20 different amino acids each have different chemical properties. It is the combination of these different amino acids together that confer proteins their structure, and it is that structure that provides their function.

So to engineer a new protein, the key is to find the right combinations of amino acids. This may sound like a simple task, but unfortunately it is actually one of the most complex and difficult search problems you can imagine!

Take the average length of a protein: 400 amino acids. At each position within that protein, you have 20 different amino acids to choose from. That works out to be 1 * 10^52 different combinations of amino acids — that’s more than the total number of atoms on Earth!

The largest protein discovered in the natural world runs at around 33,000 amino acids in length. The number of combinations of amino acids for a protein of this length explode to well beyond the number of atoms in the universe.

This combinatorial space is often called ‘sequence space’. It captures all of the potential protein sequences that have, do and could possibly exist.

Protein engineering thus far

Given this incredible complexity, how have scientists managed to engineer new proteins so far? The traditional approach is to copy evolution.

Let’s take an example related to drug discovery. We start with a known point in sequence space — antibodies. Antibodies are a class of protein used by your immune system to combat nasties in your body. As a result, they make a great starting point — your body already recognises them, so they are not likely to be highly immunogenic; They are capable of being extremely specific for their target; They have a ‘built-in’ immune effector function — antibodies alert other parts of our immune system to the presence of an invader.

Once we have a starting point, we want to start making changes to its amino acid sequence to shift its function to something that is useful. In the case of antibodies, the first port of call is to change its binding to the target that we’re interested in.

To find these new binders, regions of the antibody are randomly mutated, creating large ‘libraries’ of protein molecules, each with different binding capabilities. This library is then filtered — binders to the target of interest are selected, and those that do not bind are thrown away. This process is repeated, using the best variants of previous rounds as a new ‘starting point’ and creating new mutations around these.

This process is called ‘Directed evolution’. It apes natural evolution — we create a pool of genetic variation, we apply a ‘selective pressure’, where only the fittest proteins survive and propagate, and these new proteins go on to seed the next generation.

In drug discovery, once new binders have been discovered using directed evolution, each new protein is tested against the other important characteristics we talked about above — potency, selectivity, aggregation, expression… Every time a molecule fails one of these tests, it is discarded.

You can think of this process a bit like a funnel — the further down the funnel you go, the fewer molecules are left. If you’re lucky, you end up with a few molecules that you can take forward to test in animals. If you aren’t, you’ll have to go back and engineer some new binders, and hope that these make it through.

Climbing a mountain in the fog

But there is a major failing to this otherwise quite clever approach of directed evolution.

Imagine you’re hiking in the mountains. Your goal is to find and to scale the highest peak in the mountain range. To make things interesting, the clouds have descended, and you can only see 2 meters ahead of you. What’s your approach?

Really the only thing you can do is start walking uphill. You’ll eventually reach a peak, look around, and the only way you can see is down. You made it! The trouble is, that if those clouds lifted, you’d realise that across a valley to your right, there is a peak 10 times larger. Bummer.

With the clouds so low, there is no way for you to know that mountain even exists.

This is just like navigating sequence space with directed evolution. Because we introduce random mutations into our starting molecule, we only get to see the small space around where we started.

We might find a molecule that is an improvement on our starting point, but we have no idea whether once we reach a peak, if that really is the best molecule for our requirements- there might be one that’s far, far better, but we’d have to traverse entire valleys of poor molecules to get there.

In an ideal world, we’d have a map that describes the contours of our mountain range. With that information, we could safely traverse those valleys in the knowledge that there’s a mountain to climb on the other side.

Simon Matzinger on Unsplashed

The LabGenius drug discovery engine

And that’s where we come in.

As in the foggy mountains, the problem with pure directed evolution is that you’re very likely to end up trapped at a local peak — there is just no way to know whether there is a taller one on the other side of the valley.

We mitigate this issue by constructing a map of a region of sequence space. With this map, we are able to effectively explore — parachuting down onto mountains and then exploring them more thoroughly to find their peaks.

But how do we do this?

Proprietary sequencing data + advanced computation

First we need to collect the data that allows us to construct our map. Because it’s so difficult to predict the structure and function of proteins based purely on their sequence, we rely on our own empirical experimentation to produce this data.

Using our own advanced wet labs and automated robotic systems, we are able to synthesise large protein libraries, and, like standard directed evolution, apply a “selection pressure”, separating different proteins based on how good they are at the function that we’re looking for.

Once we’ve separated the proteins by their function, we read them using Next Generation Sequencing technologies. This allows us to read 10s of millions of sequences in just a couple of days. It is this data, from our selection experiments, that allows us to build up our map.

Machine learning models + flywheel effects

While the idea of 10s of millions of empirical data points certainly seems like a lot, when comparing this to the magnitude of sequence space, it is in fact an infinitesimally small fraction.

Even if we were to just focus on the regions of an antibody molecule that bind a target, there are some 300,000,000,000,000,000,000,000,000,000 different combinations of amino acids to choose from. How could so few data points give us an accurate, complete map of such a huge space?

This is where machine learning comes in. By training machine learning models on this data, we can build a model of the space; extrapolating and making predictions about sequences that we’ve never empirically tested before.

The more we test, the more data we accumulate, the more accurate and complete our maps of sequence space become — a true flywheel effect. This approach will enable us to reach higher peaks than a traditional directed evolution approach ever could.

Multi-parametric optimisation + higher dimension perception

And what about all of those characteristics we covered that are required to make an effective drug? The beauty of using models to inform discovery is that we are not restricted to only optimising for one feature at a time. In fact, we can build multiple models of many different features, and can optimise across all of these at once. We call this multi-parametric optimisation.

Continuing with the metaphor of building maps of sequence space, imagine overlaying a number of these maps on top of one another, then looking for peaks that appear at the same location across all of them.

This higher dimension perception allows us to build in those key considerations that are usually left to chance. Not only does this reduce the likelihood of problems arising downstream, but it also opens up possibilities for developing proteins with functionality that would previously have been extremely challenging.

Transforming the future of medicine

There is one final benefit of using a machine-learning driven approach to biologics drug discovery.

Over time, by accumulating more and more data linking protein sequence to function, we will begin to build ‘generalisable’ models. These models allow us to use learnings produced for one project to inform the designs for an entirely different one. With greater and greater amounts of data, the accuracy of these models will continually improve.

This is a huge opportunity: A future in which models can accurately distinguish between a drug that will succeed and a drug that will fail would slash the risk associated with developing new biological drugs.

Lowering the risk profile (and cost) of drug discovery may not be intuitively exciting, but the opportunities for human health are vast: it opens up the opportunity to develop drugs for diseases where there isn’t — yet — potential for huge economic return: rare diseases and diseases that predominantly affect the developing world.

Accurately predicting successful drugs will not only reduce the risk of drug discovery but will accelerate it too. We live in a time when the need to rapidly discover new drugs to treat emerging diseases couldn’t be clearer. Think of the human and economic toll the COVID-19 pandemic has taken. The need for a faster and more efficient way to develop new drugs is obvious.

And let’s not forget that most diseases are not homogenous — the rise of research in personalised medicine has demonstrated this. However, without a step change in the costs of drug discovery, we have absolutely no way of capitalising on this understanding.

With our approach, we can finally begin to imagine a world in which treatments are developed for individual patients.

Reaching these goals will not be easy — it will require significant investment of time and capital, and the bridging of deeply technical and disparate fields. Get it right though, and there is an opportunity to take the next giant leap for humanity.


Discovering next-generation protein therapeutics using machine learning.