The Scale and Complexity of Protein-Ligand Binding: A Mathematical Perspective on OOD Errors

Freedom Preetham
Autonomous Agents
Published in
7 min readAug 16, 2024

The problem of protein-ligand binding lies at the intersection of biology, chemistry, and mathematics, presenting a challenge of staggering proportions. It is a confluence where the structural diversity of proteins, the vast chemical space of potential ligands, and the intricate configurations of binding sites collide with the changing landscape of conformational dynamics. Each of these elements contributes to a combinatorial explosion that defies simple analysis, demanding a more profound exploration into the heart of this biological phenomenon.

In the previous article “Exploring the Challenges of Protein-Ligand Binding Predictions in AI Through Leash Bio’s BELKA Challenge” I covered why I’m skeptical of the belief that more data will resolve AI’s current limitations in drug discovery, particularly regarding reasoning and generalization. While AI models like those in Leash Bio’s BELKA challenge demonstrate proficiency in memorization, they fall short when extrapolating to novel chemical spaces. The focus should shift from merely accumulating data to developing models that can reason, plan, and generalize — capabilities essential for real breakthroughs in drug discovery. More data alone won’t solve the underlying issue; we need more sophisticated models that learn governing principles, not just instances.

In this article, I want to show the sheer scale of the dataset needed to reduce OOD errors.

(for non-biologists) Proteins, are composed of sequences of amino acids. These sequences, when considered in their entirety, form an expansive space where every possible combination could represent a functional or non-functional entity. For a protein of length L, this space is mathematically expressed as 20^L. However, not all sequences yield viable proteins. Evolution has sculpted only a fraction of this space into functional forms, narrowing it down to what is known as the functional protein landscape. Structural classification systems, like SCOP and CATH, have categorized these functional proteins into approximately 10³ to 10⁴ distinct folds, reflecting the deep evolutionary relationships and structural similarities that exist among them. Yet, even within these folds, the potential for variation is immense, leading to millions of unique sequences capable of binding to a vast array of ligands.

The ligands themselves, small molecules that interact with proteins, occupy a chemical space that is almost incomprehensible in its vastness. The number of synthetically accessible small organic molecules ranges between 10⁶⁰ and 10¹⁰⁰, depending on the breadth of chemical diversity considered. Within this vast expanse lies a subset of drug-like molecules, still numbering between 10²³ and 10⁶⁰, an evidence to the boundless possibilities that chemistry offers. The interaction between these molecules and proteins forms the basis for many of life’s essential processes, and understanding this interaction is key to advancing fields like drug discovery.

But the challenge does not end with the sheer number of proteins and ligands. The binding sites on proteins, the regions where ligands attach, add another layer of complexity. These sites are not static; they are dynamic, capable of shifting and changing in response to the presence of a ligand. This conformational flexibility means that a single binding site might present multiple configurations to different ligands or even the same ligand under different conditions. This variability requires a combinatorial exploration of binding site configurations, further complicating the task of predicting protein-ligand interactions.

The conformational dynamics of protein-ligand complexes introduce yet another dimension to this problem. Each complex can adopt a multitude of conformations, depending on the flexibility and structural characteristics of both the protein and the ligand. The potential conformational space, governed by torsional angles, rotatable bonds, and other structural features, leads to an astronomical number of possible binding poses for each protein-ligand pair. This diversity must be accounted for in any attempt to model these interactions, adding another layer to the already complex problem.

Given these considerations, estimating the size of a universal dataset for protein-ligand binding becomes an exercise in confronting the limits of what is knowable. The size of such a dataset can be represented by the product of the number of proteins, ligands, binding site configurations, and conformations:

Where:

  • N_proteins​ could range from 10⁶ to 10⁹, reflecting the vast diversity of protein sequences and structures.
  • N_ligands​, representing the chemical space of small molecules, might range from 10²³ to 10⁶⁰.
  • N_binding​ could be 10³ to 10⁶, depending on the level of structural variability considered.
  • N_conformations​ might be 10² to 10⁴ per interaction, accounting for the conformational flexibility of both proteins and ligands.

Even at the lower bound, the size of the universal dataset for protein-ligand binding could be as large as:

At the upper bound, it could reach:

These numbers are staggering, almost beyond comprehension, and they emphasize the sheer enormity of the task at hand. It becomes clear that capturing the entire space of protein-ligand interactions within a single dataset is not only impractical but perhaps impossible. This realization leads to a deeper understanding of the limitations inherent in any data-driven approach to protein-ligand binding.

The Limits of Generalization in AI Models

Even if we could generate such a vast dataset, AI models trained on it would still face significant challenges in generalization, particularly to out-of-distribution (OOD) data. The generalization error — the difference between a model’s performance on training data and its performance on new, unseen data — is central to this discussion.

In statistical learning theory, the generalization error can be bounded using the concept of VC dimension, d_VC​, a measure of the capacity of the hypothesis space H — the set of all possible models the AI could learn. As the number of training samples n increases, the gap between the training error and the expected error decreases according to the following bound:

However, this bound assumes that the training and test data are drawn from the same distribution. When this assumption does not hold — as is often the case with OOD data — the bound becomes less meaningful, and an additional generalization error is introduced. This error arises from the distributional shift between the training and test data, a shift that is independent of the size of the training dataset.

The OOD error can be understood in terms of the divergence between the training and test distributions, commonly measured by the Jensen-Shannon divergence or the Wasserstein distance. These measures quantify the difference between the distributions:

where

A higher divergence indicates a greater difference between the training and test distributions, resulting in larger OOD errors. No matter how large the dataset becomes, this OOD error does not vanish because the distributional shift is an inherent feature of the problem space. The model’s ability to generalize is fundamentally limited by this shift, not by the volume of data alone.

Considering the vastness of the universal dataset space for protein-ligand interactions — making the OOD events rare will require 90% capture of the universal dataset. A 90% is 10³¹ on lower bound to 10⁷⁸ on the upper bound.

Even if we say we need a dataset of size 10³⁰, that will be equal to all the atoms that makes up a star that is half the size of our Sun. It will not be a hyperbole if we say this is a ‘astronomical’ number!

Practical Implications for Protein-Ligand Binding

These mathematical insights have profound implications for the field of protein-ligand binding and, more broadly, for drug discovery. Even with the most extensive datasets, models will continue to struggle with generalization unless they are designed to address the fundamental issues posed by distributional shifts. The practical reality is that more data cannot fully compensate for the intrinsic complexity of the problem. Instead, advancing this field will require models that are capable of reasoning beyond the specific instances they have been trained on, models that can infer the underlying principles governing protein-ligand interactions and apply them to new, unseen cases.

In the end, the scale of the universal dataset for protein-ligand binding, coupled with the mathematical limits of generalization, reveals the boundaries of what current approaches can achieve. It challenges us to think more deeply about the nature of the models we build and the strategies we employ to push the frontiers of biological understanding. Only by acknowledging and addressing these fundamental limitations can we hope to make real progress in the application of AI to one of the most complex problems in biology.

--

--