Where is generative design in drug discovery today?

23 min readJul 3, 2023

Disclaimers

The opinions expressed within this article are solely mine and do not reflect the opinions and beliefs of my employer or any affiliate.
In the following, I will talk about both generative protein design as well as generative small molecule design. It is important to remember that these two approaches are in fundamentally different in many ways, for example in the way we represent the molecules (e.g., SMILES vs a sequence). I will try to only make statements which are credible for both areas, however, the degree to which this applies may vary.

Introduction

In this blog post, I share the current state of generative molecular design, and offer my perspective on its progress to date. I will endeavour to explain why some past criticisms are no longer relevant and highlight the remaining challenges that need to be overcome to further improve the impact that generative design can have on the drug discovery pipeline.

Neoleukin, originally established in 2021 from David Baker’s Lab, and arguably the first company to administer a de novo designed protein therapeutic to a patient, is currently exploring strategic alternatives such as a sale or merger. This follows the failure of their de novo designed IL-2/IL-15 agonist, NL-201, in phase 1 clinical trials. Failures like this remind us that generative design won’t immediately revolutionise drug discovery and result in better drugs overnight (i.e., ones that end in a patient). However, despite such setbacks, it is important to highlight the promise that this field offers and not forget the advances that have been made over the past few years. There is still substantial opportunity to discover hit and lead molecules more quickly and affordably by adopting principled approaches in molecular design, or find molecules that can address targets that would have once been classified as ‘undruggable.’

The field of generative molecular design is currently experiencing a huge surge in interest, particularly in the area of generative protein design. Despite the criticism, I believe that some of this interest is justified as these methods have the potential to accelerate the early stages of drug discovery.

Generative or de novo molecular design for both small molecules and proteins has been around for a few years now and following the success of AlphaFold, activity has recently picked up again. A lot of attention is on so-called diffusion models with the most recent versions being able to co-generate protein sequence and structure. Protein diffusion models include RFDiffusion from the Baker lab or Chroma by Generate — the main difference? Chroma has more ML acrobatics (chain co-variation, classifier guidance-based conditioning, a novel low-temperature sampling method for diffusion models that increased sample quality whilst maintaining high diversity, efficient GNNs for modelling long range interactions, etc. ) but RFDiffusion has better experimental validation. Combining all of the ML innovations, however, Chroma was able to train their model from scratch, which didn’t work for RFDiffusion and instead had to use pre-trained RosettaFold.

These diffusion methods have been used to design antibodies that bind to desired antigens and other proteins. Whilst early results look promising, a lot more validation is necessary to understand how they fit into the wider protein engineering toolbox. From conversations with multiple therapeutic antibody discovery teams, they generally seem to work well and result in hit rates of 2–10% when designing binders targeting a specific antigen, with the rates varying between different targets. A recent paper by a group of researchers from NYU and prescient design (Genentech) applied these approaches to the optimisation of antibodies for higher expression yield and binding affinity even found a 25% binding rate (log10(KD) values ranging from 6 to 8) against the target and 97% expression yields using in vitro assays (based on 68 designs).

While more and more models are becoming available (including many open-source ones such as REINVENT, LS-MolGen, or ChemTSv2 just to name a few), my personal take is that more data will be required to build large pre-trained models (for example, a model for antibody binding to any target). Once that data is available, these models could be particularly useful for the antibody lead discovery stage where binding is the main objective. Whether this will translate into functionality is still unclear. For example, T-cell activation and CD3 binding are not the same thing: Not every T-cell engager that binds CD3 will result in T-cell activation, even if cross-linking is achieved. Testing such functional properties also requires more complex functional assays, which by definition require a lot more time and resources when compared to simple binding assays. However, finding more binders will likely increase the chance that one of them achieves the desired function.

Beyond function, I think there will still be a need for other methods and more data generation to overcome some of the key challenges associated with the current generation of therapeutics. Take for example CD3-targeting bispecifics: We need new therapeutic approaches that can overcome the dose limiting toxicity that appears to be one of the main challenges for their ability to address solid tumours — see here or here. For example, in the absence of cell-level data (e.g., receptor densities on the cell), the use of generative models for multi-specific T-cell engagers or ADCs is still a long way off.

Generative design (machine learning for chemistry) has been regularly applied in the small molecule space. Most notably, in one of the biggest cancer target classes, kinase inhibitors. A lot of the early papers and claims have been rightfully scrutinised. See for example, the criticisms by Pat Walters, Derek Lowe, or Andreas Bender on the early DDR1 paper. This resulted in a number of good recommendations with regards to how we should evaluate such methods (e.g., here and here).

Since then, however, many of the initial challenges such as generating ‘crazy’ structures or the exploitation of the scoring functions have been addressed by the community. While some papers still make elementary mistakes (see for example figure 3 in this paper), more recent papers address many if not most of the initial shortcomings, such as the lack of prospective validation and testing of key properties, such as selectivity for kinase inhibitors. (It’s important to note that there are a lot of challenges with retrospective studies, even using data sets like the therapeutic data commons; For a discussion see here and here; The CACHE challenge is worth highlighting as a new option to test a generative model in a prospective setting.)

Molecules that are generated today can satisfy relevant ranges for physicochemical properties like molecular weight, logP, or logD, and pass through MedChem and PAINS filters. They are selective (where needed), and progressively more examples also include experimental validation. They might not unilaterally be viewed as perfect molecules but this paper by GSK performed a molecular Turing test and showed that all generated ideas were liked by at least one chemists. This is quite good in my opinion, since if you ask two chemists which molecules are better, they would likely disagree anyway.

**Figure 1**: Turing test for molecular generators. The number of likes per molecule, coloured according to the origin of design: by chemist (yellow), computer (dark green), or both chemist and computer (light green). All ideas were liked by at least one chemist, and most were liked by more than half of the chemists. Source: A Turing Test for Molecular Generators

One paper that demonstrates many of the above points is by Li et al., which identified a selective RIPK1 inhibitor with a previously unreported scaffold. They also demonstrated selectivity (experimentally) using a panel of 406 kinases (at 10 µM) including key off-targets. To do this, they generated a targeted virtual library and then screened this library with conventional virtual screening methods to prioritise candidates for testing. Notably they did not actively optimise for selectivity or build this in their design process. Although this is a great example of real validation, it is still ‘level 1 automatic chemical design’ (thanks to Morgan Thomas for pointing this out). Automatic chemical design level 1 refers to a degree of ideation automation in chemical design where the machine generates ideas for molecular structures, while the chemist still provides varying amounts of guidance and retains the responsibility of final selection and assessment of the synthesisability of the molecules.

**Figure 2:** Some off-targets for the RIPK1 inhibitor tested by *Li et al.*

Over the past few years, I have seen first hand in my own work, and also other research groups and companies (for example the collaboration between Microsoft and Novartis, highlighted on pages 15/16 in this report), who have successfully applied these approaches in their internal pipelines. Many new generative approaches have been developed with various different applications such as linker design, or scaffold decoration and combinatorial library design or for specific types of drugs, such as covalent kinase inhibitors. Generative design can hence be used for scaffold hopping, fragment linking (including PROTAC design) and fragment growing, R-group optimisation, and library design.

Improvements have been made on the efficiency of the methods (here using curriculum learning or here through directly optimising the learning steps) and the way in which they are integrated into conventional drug discovery programs.
It is worth noting that RNN-based generative models are the most prevalent generative small molecule design models in the literature (c.f., Fig. 10 in this article). More recently, there has also been a surge of structure-based ligand design papers, which directly ‘diffuse’ (denoise) a ligand into the 3D protein pocket. For a good introduction (for chemists) on how these models work, I recommend this recent article. Models, such as DiffSBDD, can be used for a variety of tasks including fragment linking, scaffold hopping, or scaffold elaboration. However, most structure-based papers do not have experimental validation, and generated molecules have significantly more physical violations and fewer key interactions compared to baselines, so it’s unclear how much value they will provide. Look out for a paper by Charlie Harris and others which investigates this in detail and will become available soon.

Similarly, generative protein design has moved from hallucination of ‘random’ protein structures to models that generate proteins with well defined properties, for example antibodies that bind a specific antigen or CDR regions targeting a specific epitope.

While the challenges for generative small molecule design and protein design are somewhat different, some of the open problems are highly related.

In the next section I will address some of the criticism on de novo small molecule design and then go on to talk more about the real-world challenges in both the fields of generative protein and small molecules design.

Common criticisms of de novo (small molecule) design are outdated

While I agree with a lot of the initial criticism, some of the more recent comments (for example Pat Walter’s blog post or Derek Lowe’s commentary on generative chemistry) no longer feel appropriate.

There are two arguments against generative (small molecule) design that are still frequently raised:

These methods do not generate drugs
Generative approaches are mainly used to design binders (hits). This means that the molecules are optimised for protein binding, a set of physicochemical properties, and for passing certain MedChem and PAINS filters. This is not as hard as optimising for properties that are truly relevant in drug discovery (e.g., selectivity for kinase inhibitors)
Generative models do not come up with truly novel molecules (i.e., something entirely unseen) and instead they just design molecules within the known chemical space of existing drugs

I will try to address these separately:

An argument that is often made is that these methods do not generate drugs, which is absolutely correct. However, these methods do still add value in helping us find good starting points (e.g., selective hits) or early leads for our programs as fast as possible. If a generative approach can provide us with a large and diverse set of lead-like molecules that fulfil our target molecular profile, then we have a higher chance of progressing into the next stage, due to having a sufficient number of candidates to test. They are ultimately another tool in the large drug discovery toolbox that we should lean on in combination with all of our other methods, either explicitly, or implicitly.
While I agree that the generation of molecules with more complex profiles (for example, selective, non-promiscuous Kinase inhibitors) has rarely been demonstrated in the literature, the above mentioned publication is a good example of such a result, and there are increasingly more papers attempting to perform more rigorous experimental validation. I am also aware of multiple industry run programs that have achieved similar results.
It should be noted that generative models will not solve all our problems. In fact, the reinforcement learning based approaches, such as REINVENT (which continues to be one of the best performing methods — see here and here), can only create molecules for properties that we understand or can predict well. For example, selectivity for small molecule kinase inhibitors is inherently hard to predict and neither conventional methods (docking, FEP, etc.) or machine learning can reliably predict it off-the-shelf. For machine learning, this is largely due to a lack of reliable selectivity data in the right quantities .
I am not a big fan of the third argument for two reasons.
Firstly, a recent paper by Brown and Boström (Astrazeneca) that analysed the origin of clinical compounds concluded that most clinical candidates (43%) were derived from known starting points (e. g., literature, patents, previous corporate knowledge, etc.), followed by random screening at only 29% (see figure 3 below).
Another recent publication by Roche and Genentech analysed lead-finding trends and the origins of lead series from their organisations between 2009 and 2020 and found that public information made up the majority of development candidates for both organisations (31% and 35% for Genentech and Roche, respectively). This was more recently confirmed by another study by Brown and Boström, which highlighted that the most frequent lead generation strategies resulting in clinical candidates were from known compounds (59%) and random screening approaches (21%).
In addition to this, more recent methods all incorporate mechanisms to introduce more diversity and novelty (example here). An even more extreme example is that 400 CNS drugs come from just 20 natural product scaffolds.
Also, from a drug discovery perspective, it’s irrelevant if the molecule is known or not as long as long as (a) it can potentially help a patient (i.e., it offers a benefit over the current standard of care), and (b) it is novel from a patent perspective, which is typically a requirement for companies to progress their molecule. Beyond this, using a known molecule will typically reduce the development risks because something that has already been in a human is much less likely to result in unexpected side effects.

**Figure 3:** Lead generation strategies that resulted in a clinical candidate. The majority of clinical candidates rely on known starting points from the literature, patents, and other sources. Source: Brown and Boström.

I do, however, agree with Pat and others, when they say that there is a tendency to overstate the impact of the methods. After all, it’s still unclear whether this will translate to a higher probability of success in the clinical trials of these molecules. While the clinical impact of generative methods is still to be seen, it should be noted that so far the biggest impact of ‘AI’ in the clinic has come from patient stratification: By using machine learning (or bioinformatics), it is possible to double the clinical probability of success (see table 3 here), which is huge.

Putting these criticisms aside, what are the challenges we’re facing in the field of generative design today?

Where are the actual challenges of generative molecular design?

I see two main limitations for these approaches at the moment:

Little, and low quality data and slow data generation (e.g., molecular synthesis)
No data for new targets: A lack of structure, binding sites, and binding data
Cultural challenges for the adoption of new methods

Little, and low quality data and slow data generation

Despite advances in the methods/models, data availability and collection is still the biggest bottleneck in ML/AI for drug discovery. This also applies to generative design approaches. For small molecules, the synthesis step is usually the rate limiting factor. While public data or the use of off-the-shelf libraries can improve the speed of the initial iterations and get you to a hit more rapidly, it usually doesn’t allow you to get to an optimised lead.

Good quality data is usually sparse, in particular for novel targets, and the quality of public data is rarely good enough to use (see also this recent rdkit blog post). Using such data helps find hits, but due to the noise, it has limited use for the optimisation of more complex properties, such as selectivity.

So the first challenge is the availability of good quality data in sufficient amounts that they can be used by machine learning systems. The underlying causes are often the lack of resources and time requirements to obtain new data. Data generation in the case of small molecules means synthesising the molecule and then testing it. This bespoke synthesis can take several months and yield as little as 10s of molecules. These limited amounts of data are very challenging for machine learning models. For protein engineering this problem is less severe, as there are ways of expediting the build and test steps. For example, using IDT e-blocks (with a length between 300 and 1500 base pairs) for build and testing can be done in as little as 2 weeks when combined with cell free protein synthesis. While these timelines increase for more complex molecules (e.g., ADCs or multi-specific and multivalent antibodies) or functional read-outs (such as T-cell activation), the cycle times are still much faster. The most advanced companies in this space take less than 6 weeks for a full cycle and create hundreds of readouts from disease relevant, cell-based assays. For most companies this impressive turnaround time is unachievable, and even data generation in the protein design space remains comparatively slow. Many companies require 4–6 months to test a few hundred candidates and this challenge only increases with the complexity of the biology being investigated.

Generating high-quality, program/target-specific data at speed can be approached in two different ways):

One approach is to use pre-existing building blocks (fragments) to create molecules. This imposes restrictions on the design space (i.e., the total number of different molecules defined by all the combinations) that we can explore, but can enable significantly shorter timelines from design to experimental testing. While typically not done in a generative fashion, this strategy is also possible for designing antibodies. For example by combining pre-existing building blocks such as antigen binders and linkers to design more complex multispecific and multivalent antibodies. This is successfully done by companies like Harmonic Discovery in the small molecules space (for kinase inhibitors, which offers enough data to do so), or by LabGenius for multispecific and multivalent antibodies
A second approach is based on modern retrosynthesis tools, which have made significant leaps forward and have found their way into many applications including generative design (see e.g., here or here). They are, however, still limited in their scope. While there are many ongoing efforts to improve both models and data sets, it appears that the availability of high-quality data might also limit our ability to perform retrosynthesis and on-demand-synthesis. Another challenge with retrosynthetically proposed synthesis steps might not match in-house available chemistry capabilities. This can further slow down the process or limit the applicability.
Related to this, we can use known synthesis rules to generate molecules that can be synthesised in a small number of steps. For example, a range of relevant scaffolds can be identified using off-the-shelf libraries and then generative methods then applied to decorate these scaffolds to design large libraries. This reduces the time to perform the synthesis and can help explore the space much more efficiently, albeit constrained to these known scaffolds

This overview by Meyers et al provides more information on the above two approaches. Note that the paper classifies the different design approaches into atom-based, fragment-based, and reaction-based, where the fragment- and reaction-based approaches correspond to the two different approaches discussed here, while atom-based corresponds to the unconstrained design case and the atom-based generative approach corresponds to the unconstrained (but slow to synthesise) generation.

On the other hand, there is also a need for more structural data. This is challenging because resolving crystal structures can be slow and expensive. We, however, see some companies heavily investing in this space. For example, Generate has just announced the opening of their new Cryo-EM facilities.

An entirely different approach is taken by Lee Cronin and his team at Chemify. He is attempting to build an universal synthesis machine — the chemputer (see here or here for more of the science). While potentially a long shot, it is definitely one of the most exciting ideas out there!

Beyond the synthesis step, the other key bottleneck for all types of machine learning (generative/unsupervised as well as supervised) is the quality and relevance of the data.

While an immense amount of data is generated in the field of chemical and biological sciences, many existing databases are often noisy, imbalanced, biassed, or contain incomplete annotations. This lack of high-quality data can result in underperforming models, and even lead to common errors such as impossible valences in chemical structures. Notably, the quality of the data affects small and large models alike (for example this large language model, called the Falcon model). However, the smaller the data set, the more important this becomes. In drug discovery, where we mainly deal with small data amounts this is hence much more severe.

Similarly, for newly collected data, data curation and quality checks are crucial but often underappreciated tasks. However, these non-curated datasets can significantly affect the harmonisation of information and, consequently, model quality and utility. Many companies don’t pay enough attention to these sorts of issues.

To address these challenges, I think that one of the most important areas to make progress is high-throughput (high-quality) experimentation and building of end-to-end experimental and data pipelines, and all the work required at the intersections). To minimise noise and errors such systems should rely as much as possible on robotic automation that promotes consistent experimental and computational workflows, right down to the storage of controls. Companies like Xtalpi have, for example, massively invested in robotic synthesis platforms for these reasons.

Despite the advantages of automated data creation and collection, accessibility to such platforms is still limited. This is due to the high costs, lack of consistency across providers, and lack of experience in the executing teams. Future data-driven research may be increasingly automated as these technologies become more widely available and teams acquire the know-how, allowing for high-quality, consistent information generation with minimal human intervention.

Beyond experimental and process accuracy, the importance of conducting highly informative and diverse experiments without human biases is hugely important (e.g., creating negative examples). Methods such as active learning can drastically improve the performance of machine learning by selecting the optimal candidates for training predictive or generative models. See for example here for a recent blog post by Winston Haynes and myself which touches on this. Active learning has been successfully applied to the exploration of large chemical spaces by using deep learning in combination with FEP and docking. There is a long list of examples, and I recommend having a look at some of these [Ref1, Ref2, Ref3, Ref4, Ref5, Ref6, Ref7, and most recently Ref 8]. Active learning has also been successfully used by companies such as LabGenius to perform fast antibody optimisation by selecting the optimal candidates to test experimentally. Notably, the fast cycle times get continuous feedback which are crucial to make such approaches work.

However, most fundamentally we will need to increase predictive validity of the methods we use to generate the data. Better translation into the clinic is only possible by using more complex and more disease relevant biology. Examples here are organs-on-a-chip or artificial organoids. Many of these fields are still nascent, but hold a lot of promise for the future.

No data for new targets: A lack of structure, binding sites, and binding data

The second challenge is that many generative models, such as REINVENT, rely on scoring functions. These can be purely ML based but the highest performance is usually achieved by additionally relying on structure-based methods such as docking. For these, we need an input structure for the target. Whether it is a small molecule de novo design algorithm taking in the protein structure (e.g., for docking and binding affinity predictions), or a generative protein design model taking in the structure of the target antigen.

A lot of progress has been made in predicting protein structures, but it is still not considered a reliable method for the design of new drugs.

While AlphaFold2-generated protein structures appear to offer a degree of efficacy when used in Free Energy Perturbation (FEP) calculations, a result primarily attributed to molecular dynamics simulations facilitating adjustments in sidechain positioning, they appear to be less useful for molecular docking (see for example here and here for docking into protein targets and here for antibiotic discovery).

For example, the study conducted at the Scripps Institute observed that the resolution of side chains in AlphaFold2 structures is not precise enough to perform accurate docking when compared to protein crystal structures. Consequently, the docking success rate using AlphaFold2 structures was a modest 17%, which is significantly less than the 41% success rate achieved with holo crystal structures. Despite these suboptimal results, it’s worth noting that the docking performance with AlphaFold2 structures was markedly superior to the 10% success rate documented with apo x-ray structures.

While structural data is still essential for the targeted design of small molecules, in particular to leverage differences between targets and off-targets, it is also used by many generative protein design methods. For example, among other methods, DiffAb, relies on the co-complex of the antibody (the framework) bound to the antigen to fill in (‘graft’) CDR regions that bind. If the co-complex is not available or not of good quality, the performance is likely to drop.

The availability of high-quality, relevant (e.g., holo) crystal structures will therefore remain a challenge if docking and other methods are used as scoring functions. For novel targets there is a further challenge arising from the need to identify cryptic pockets, allosteric sites, or binding sites more broadly speaking. Methods such as CryptoSite or PocketMiner had some success, and combining these with more conventional approaches, such as mixed-solvent molecular dynamics simulations can improve the overall performance. It has also been shown that structure predictions for multimers can be useful in identifying protein-protein interactions and hence potential binding sites.

However, while all these methods are promising and have to an extent entered into the toolbox of early drug discovery, they are still not reliable enough to pose a simple solution for all problems. Further progress is needed.

The same holds true for machine learning based binding affinity predictions, which can for example be done using a range of approaches, including proteochemometric or multitask models. Most models however, show poor predictive abilities for understudied targets or shifts in chemical space, and limited ability to assess ligand selectivity and promiscuity. In many instances, they still do not outperform baseline models such as Random Forest. The underlying reason here is the lack of high quality (low noise and consistent) binding data for many targets and a wide range of chemistry.

Cultural challenges for the adoption of new methods

The field often imposes stricter standards on computational methods than it does on human-driven processes. Confidence in these methods can be hard-won but easily lost with just a few failures. Unfortunately, implementing new techniques involves more than simply persuading seasoned medicinal chemists to change their standards.

Crucially, both sides, computational and experimental scientists, must adapt. They need to assimilate the language of their peers, gain a deeper understanding of the respective topics, and grasp the intricacies of the computational and experimental processes (see also this article).

To make machine learning truly effective in real-world programs, it’s also necessary to have a complete technology stack where each component seamlessly interacts with the others. This harmony allows for the generation of high-quality, adequate volumes of data, which can ultimately empower machine learning methods to perform optimally and make significant contributions to discovery campaigns. Achieving this synergy necessitates extensive communication among all participating teams. This spans from assay development teams (to ensure proper controls and normalisation) and automation teams (to guarantee process consistency, high throughput, and fast cycles) to software engineers (for the construction of robust data pipelines). The most successful teams today have mastered the tight integration of the entire technology stack, fostering open dialogue and collaboration among experts from diverse backgrounds.

To achieve this, it is important to have leadership teams that have fundamental understanding of both the experimental and computational work, that understand the drug discovery process, and see the value of combining these fields. It also requires the teams to spend additional time on building a good understanding of the processes and methods their respective counterparts are using. For example, a machine learning scientist needs to understand the possible errors and sources of noise in the data they are dealing with, the ways the data is obtained or processed by the experimental device, and many other details to optimally use that data. At the same time, experimental scientists need to understand fundamentals of algorithms and requirements on the data (e.g., regarding consistency) in order to build their pipelines in a way that generate robust data, or communicate potential challenges and limitations of their experiments and how this could affect the machine learning models.

Future directions for generative design: New models and better integration

While I believe that progress in the above mentioned areas will have the biggest impact on the field, there are also interesting developments on the model side: Novel approaches such as diffusion based generative design and co-folding (see for example here or here) are being pursued by a number of academic groups and companies such as Entos or Charm therapeutics (ligand-protein co-folding) and Latent Labs (a UK-based generative protein design company).

Co-folding could improve on the current generation of small molecule de novo design methods due to it being less dependent on the protein structure as a starting point. As mentioned above, many methods rely on scoring functions such as docking which are inherently limited by the crystal structure that is used (see above)and the lack of conformational flexibility. There are more expensive approaches that do take this into account but they become infeasible in combination with de novo approaches where hundreds of thousands of molecules need to be screened virtually. Alternatively one might aim for better scoring functions and incorporating more of the dynamics/flexibility into the scoring function. For example, this would be possible by relying on methods like DiffDock (or DiffDock-PP), instead of conventional docking approaches. While these methods already work really well to identify novel binding sites (e.g., if we want to find a cryptic pocket) and lots of similar data is available (e.g., for kinases), they are yet to outperform conventional approaches when used for docking into established/known pockets (i.e., they don’t seem to work here as well as conventional docking approaches like Glide). I anticipate that in the near future people will likely use combinations of DiffDock (to identify a pocket) and Glide or similar methods (to dock into it).

A comment here: The fact that there are barely any results for co-folding about 3 years after the release of AlphaFold makes me believe that it might be a harder problem than first thought. I also hope that the introduction of the protein-ligand complex prediction category in CASP15 will spark more interest and advances in this area and similarly the CACHE challenge for generative models for small molecules.
It would be interesting to see companies submitting their results as this would also lead to more transparency on how well these models really work.

Another direction that has already gained a lot of traction, is the integration of a variety of different approaches into the design process.

While pure ML based scoring functions can work in isolated cases, it has been shown repeatedly that integrating generative models with conventional physics-based approaches (for example docking, pharmacophore, or shape-based models) outperforms (see e.g., here or here) generative design based on simpler scoring functions or machine learning based scoring functions alone. In particular ensembles of different methods, for example machine learning models combined with physics-based approaches, can result in dramatically improved performance. I have myself observed that ensembles easily outperform other methods in multiple instances. This is unsurprising, since ensemble models have been known to perform better for a while, for example in the context of QSAR methods (see here or here).

Note that building machine learning models for structure-property relationships will, however, remain challenging. This is due to the fact that compounds, even those slightly modified from a parent molecule, can still possess significantly different properties. For compound activity this is commonly referred to as an ‘activity cliff’ where similar structures exhibit different activity (see examples here, here, or here), and results in optimisation landscapes that are very hard to model.

A final point to make is that a lot can be learned from the closer integration of the models with experiments. This is increasingly being done and the learnings from these programs will inevitably help make the methods more useful in the future. A big part of this will be lab automation, and close integration of data generation, collection, and analysis. I will write more in detail about the best approaches for this in the future.

Conclusion:

The key challenges in de novo molecular design revolve around data availability and quality, as well as the application and development of innovative methods. Data scarcity and the slow pace of molecule synthesis present formidable hurdles for machine learning in drug discovery. High throughput experimentation, end-to-end experimental and data (collection and management) pipelines, automated systems, and more accessible platforms can help overcome these challenges. Through better selection of compounds we can further enhance our ability to gather insights or improve machine learning model performance.

Emerging methodologies such as active learning driven experimentation, diffusion-based generative design, and co-folding represent promising avenues for progress. However, further advancements are needed, in particular for novel targets.

Creating more collaborative teams, with a shared language and understanding of the technologies and requirements (e.g., compound synthesis, compound management, restrictions around logistics such as shipping to and from CROs, assay technology etc.) will ultimately result in the biggest successes.

In summary, the future of de novo molecular design will be shaped by improvements in data generation and quality, advancements in novel methodologies, and a harmonised integration of machine learning and experimental strategies. These strides will undoubtedly lead to more efficient drug discovery and pave the way for innovative therapeutic interventions.

Acknowledgements
I’d like to thank Andreas Bender (PangeAI and Cambridge University) in particular the great examples for CNS drugs and the impact of patient stratification on PoS, Charlie Harris (Cambridge Univ.), Morgan Thomas (Cambridge Univ.), and Justin Grace (LabGenius) for helpful discussions and comments on this post. I’d also like to thank Nathan Brown (HealX), Laksh Aithani (Charm), and Dylan Reid (Zetta Ventures) for additional feedback on the views shared in this post.