Exploring the Challenges of Protein-Ligand Binding Predictions in AI Through Leash Bio’s BELKA Challenge

Freedom Preetham
Meta Multiomics
Published in
8 min read1 day ago

I’ve been following developments in the field of protein-binding affinity for some time (even though my focus is Genomics), driven by a curiosity about how advances in machine learning might impact drug discovery. Recently, I came across Leash Bio’s Kaggle challenge, “NeurIPS 2024 — Predict New Medicines with BELKA,” which invites participants to develop machine learning (ML) models to predict the binding affinity of small molecules to specific protein targets. This is a pivotal task in and a critical piece in drug development, with the potential to significantly refine how pharmaceutical companies identify promising therapeutic candidates.

The BELKA challenge is underpinned by a remarkable dataset. Leash Bio tested approximately 133 million small molecules for their ability to interact with one of three protein targets using DNA-encoded chemical library (DEL) technology. This data collection effort, yielding the Big Encoded Library for Chemical Assessment (BELKA), represents an ambitious attempt to harness the power of big data in drug discovery. The competition uses average precision across protein and split groups as its evaluation metric, providing a standardized way to assess the performance of various models.

I have a deep respect for the effort and ingenuity that Leash Bio has put into this challenge. However, I found the conclusions drawn in a related editorial to be particularly interesting. The editorial, titled “In a Reality Check for the Field, AI Underwhelms in Leash Bio’s Binding Contest: ‘No One Did Well’,” touches on a fundamental issue: which states “Artificial intelligence models may be doing a lot more memorizing and a lot less reasoning when it comes to predicting biology, results from a new competition suggest.” This observation is not only accurate but also symptomatic of a broader challenge facing AI in drug discovery.

Ian Quigley, Leash Bio’s CEO, succinctly captures the essence of this problem: “They are pretty good at memorizing and pretty bad at extrapolating into novel chemical space.” This statement emphasizes a key limitation in contemporary AI models — specifically, their struggle with generalization. In the context of drug discovery, where novel chemical entities often hold the key to breakthroughs, this is a significant shortcoming.

However, the editorial also presents a seemingly contradictory stance. Quigley suggests that the primary obstacle to more accurate predictions is not the architecture of AI models or the availability of computational resources, but rather the scarcity of data. He posits that with more data, these models could overcome their current limitations. This perspective raises important questions about the relationship between data volume and model performance in AI.

These are two completely different stances.

  1. If we believe that AI lacks reasoning, then concluding in the same breath that more data on the same class of models will enable reasoning (indirectly inferred) is a bold assumption.
  2. If we believe memorization is a problem, then feeding more data to a class of models that is prone to memorizing is also a problem. This belief is based on the notion that since the model cannot predict on OOD data anyway, we should ‘boil the ocean’ with data so that the model will see most of the data and memorize it, and hence prediction shall be fixed.

Reasoning and planning is not and will not be an emergent behavior for the class of AI models which are prone to memorize. More data will not solve reasoning!

These positions, however, are not only distinct but fundamentally at odds with each other. On one hand, if we accept that AI models currently lack the ability to reason — a cognitive function critical for making inferences beyond the data they have been trained on — then the assumption that more data will somehow imbue these models with reasoning capabilities is unfounded. This assumption overlooks the fact that reasoning in AI is not simply a function of data volume but also of the underlying approach and governing functions and their ability to process and infer from data in a manner that mirrors human cognitive processes.

On the other hand, if the issue is that AI models are prone to memorization, then increasing the amount of data they are trained on could exacerbate this problem rather than solve it. The notion that “boiling the ocean” with data will eventually lead the model to encounter all possible scenarios, thus improving its performance, is problematic. This approach fails to address the critical need for models that can generalize beyond their training data — particularly when faced with out-of-distribution (OOD) examples, which are common in drug discovery as researchers venture into uncharted chemical spaces.

The editorial also references Rich Sutton’s 2019 article, The Bitter Lesson,” which is misinterpreted to advocate for the use of massive datasets as the solution to AI’s challenges. On the contrary, Sutton’s article, delivers a more nuanced message. It highlights the importance of leveraging increased computational power through general methods, such as search and learning, over human-engineered solutions. Sutton argues that the most significant advancements in AI have come from methods that scale with increased computation rather than from finely tuned, domain-specific algorithms. While Sutton does acknowledge the role of large datasets in the context of deep learning, particularly in speech recognition (old school, as this article was written in 2019), his emphasis is on the broader lesson that generality and scalability are the true drivers of progress in AI — not data alone.

The Kaggle competition format itself introduces additional complexities. As Quigley himself admits, the requirement that winning submissions be licensed under an MIT Open Source license, which permits unrestricted commercial use of the code or models, creates a disincentive for participation among teams working on proprietary or highly valuable intellectual property. Moreover, the relatively modest prize of $12,000 is unlikely to attract serious contenders from the domain of protein-ligand binding prediction, a problem known for its intrinsic difficulty and the demand for models capable of sophisticated reasoning. Solving this at higher accuracy is worth hundreds of millions of MoU contracts btw!

Quigley is transparent about these limitations and acknowledges the possibility that the competition may not attract the best solutions, as many potential participants might choose to withhold their methods. His invitation to others to demonstrate their superior approaches, even outside the formal competition, reflects an openness to learning and an understanding of the limitations inherent in this kind of public challenge.

I can see why Leash believes that current models are bad at OOD (of-course the Kaggle results are stark). In another article “BELKA results suggest computers can memorize, but not create, drugs”, they surmise “The team at Leash believes that by leveraging modern machine learning methods — much like those employed in text generation models such as GPT — they can accelerate the identification and design of drug-like compounds.”. This is where my discomfort lies. I have categorically stated that generative models like LLMs are not the right models for drug discovery! They are good at learning a single instance and not the underlying operator. Most of such models have weak inductive bias and will significantly fail on OOD predictions and will need to see “all data” to better infer. This is a losing game.

In reflecting on the BELKA challenge and its broader implications, it is clear that the AI field is at a critical juncture. The current focus on data accumulation as a panacea for AI’s shortcomings is, at best, a partial solution. As I’ve argued in a previous blog post, Hey VCs, Your Outdated AI Investment Strategy Will Cost You and the Ecosystem Dearly, I have surmised the future of AI lies not in the relentless pursuit of more data (alone) but in the development of models that can reason, plan, and generalize in ways that can learn from and mimic human cognition. Models that learn the governing principles and operators and not instances. Models that learn from multi-grid and multi-resolution datasets and retain the highest variability, with active and in-context learning. In my article, I quote “While this approach might yield short-term gains, it’s a dead end in the long run. The future of AI lies in developing reasoning and planning capabilities, a paradigm shift that many startups have yet to embrace.”

Without a shift towards these capabilities, the field risks stagnation, with AI models delivering diminishing returns despite ever-larger datasets.

In the following article, I cover the sheer scale of data required to make OOD errors rare: “The Scale and Complexity of Protein-Ligand Binding: A Mathematical Perspective on OOD Errors.”

Leash Bio’s BELKA challenge, while an admirable and ambitious project, ultimately highlights the limitations of current AI approaches in drug discovery. The key lesson is not merely the need for more data but the urgent need for more sophisticated models — models that can think, infer, and reason in a manner that transcends the data they are given. As the AI community moves forward, it will be critical to balance the pursuit of data with the development of models that are truly capable of revolutionizing fields like drug discovery.

That said, big win and respect for Leash for effective use of guerrilla marketing though. They got game ;)

As a segue to Rich Sutton’s article, AlphaGo, developed by DeepMind, did not primarily rely on massive datasets in the traditional sense, as seen in applications like natural language processing or image recognition. Instead, AlphaGo utilized a combination of supervised learning, reinforcement learning, and Monte Carlo Tree Search (MCTS).

Here’s a breakdown of how AlphaGo was trained:

  1. Supervised Learning: AlphaGo was initially trained using a dataset of around 30 million moves from games played by human experts. This dataset was used to train a neural network to predict the moves that a human expert would make in a given board position. However, this dataset, while substantial, is relatively small compared to the datasets used in other domains like image recognition.
  2. Reinforcement Learning: After the supervised learning phase, AlphaGo was further trained through reinforcement learning, where it played millions of games against versions of itself. This self-play allowed AlphaGo to explore a vast number of possible game states and improve its decision-making over time. The reinforcement learning phase did not depend on an external dataset but on the generation of new data through self-play.
  3. Monte Carlo Tree Search (MCTS): During actual gameplay, AlphaGo used MCTS to evaluate possible moves. This method involves simulating many possible future game states to choose the most promising moves. MCTS, combined with the deep neural networks trained through supervised and reinforcement learning, enabled AlphaGo to achieve superhuman performance.

Disclaimer

This is not merely an external commentary. I am deeply engaged in research on alternative classes of models, with proprietary intellectual property closely tied to my startup (not intended for public release). However, I am transparent about the underlying arguments, thoughts, mathematics and methodologies, offering insights that you may find useful. You can find everything I write related to my research across at-least 3 publications you can find on the profile. For further reading, here are some articles :

From Autonomous Agents:

You can also check Mathematical Musings and Meta Multiomics.

--

--