LLMs in Genomics are Misguided, Inefficient, and Scientifically Wrong

Freedom Preetham
Meta Multiomics
Published in
8 min readDec 24, 2024

Building large language models (LLMs) for human language processing leverages fundamental truths about the structure of language as the vast majority of words are essential and closely related to those immediately surrounding them. This basic feature explains why attention mechanisms in architectures like the Transformer are so effective. By focusing on the relationships between essential words, LLMs excel at understanding and generating human language.

However, attempting to apply these models to genomics reveals the stark and fundamental mismatch between the two systems. Genomes are built entirely differently, and this makes LLMs one of the most inefficient and misguided tools for functional genomics.

I continue to address this topic because the bio community has not reduced the noise surrounding LLMs in echo chambers. This issue concerns me because founders, bioinformaticians, machine learning engineers, and venture capitalists are wasting vast amounts of resources on an expensive path that diverges from building true foundational models capable of simulating genomics.

I can say this with confidence and authority, having spent 25 years in AI research and mathematics building foundational AI models throughout my career. (I recognize this may seem a departure from humility, but I find it necessary to state given that previous discussions have devolved into ‘argument by authority’ rather than engaging with the merits of the content.)

Also, all the eggs in Genomics are in the LLM basket! True innovation are either not heard nor getting funded! This bothers me.

The “Flying Horse Project” of Genomics

“Let’s build a faster horse,” said the bioinformatician. “I’ve got it!” exclaimed the other. Armed with boundless confidence and zero mathematical or physics context, they googled ‘what makes things fast’ and promptly launched the groundbreaking ‘The Flying Horse Project’, because why not?

And, the VCs were salivating all over it!!

Language vs Genomes

Here is a sharp high-level comparison between the content composition of human language and genomes.

click to enlarge

The Structure of Human Language

As shown in the table above, human language is remarkably efficient in its use of words. Over 75% of words in a coherent text directly contribute to the context or meaning of the communication. Whether it is nouns, verbs, or function words like prepositions and articles, every word typically adds to the understanding of a sentence. Furthermore, most essential words are located next to or near each other in a sequence. This proximity is critical because it means that meaning can often be extracted from local relationships, a property that attention mechanisms exploit with incredible computational efficiency.

Example: Human Language Sentence and Transformer Processing

Consider the sentence: “The cat sat on the mat.”

Tokenization: The sentence is split into tokens: [“The”, “cat”, “sat”, “on”, “the”, “mat”].

Self-Attention Mechanism:

  • The Transformer calculates relationships between every word (e.g., “cat” is the subject, “sat” is the verb).
  • Attention weights focus heavily on nearby words, such as “cat” and “sat” or “on” and “mat.”

Output: The model generates context-rich embeddings for each word, which are used for tasks like translation or summarization.

This works well because the proximity of words ensures that relationships are local and semantically clear, making the Transformer’s self-attention mechanism efficient and effective.

The Biological Design of Genomes

In stark contrast, genomes are not structured like human language. Only about 1% of the genome codes for proteins. The remaining 99% comprises regulatory elements, introns, repetitive sequences, transposable elements, and other non-coding regions. While much of this non-coding DNA plays critical roles in regulating gene expression, maintaining chromosomal integrity, or evolutionary innovation, it is far less structured and far more sparse in essential content than human language.

Gene expression is not solely determined by the sequence of DNA. It is controlled by a complex interplay of transcription factors (TFs), promoters, enhancers, signaling pathways, epigenetic modifications, nutrient levels, and signal strength. Promoters initiate transcription, while enhancers can amplify gene expression even when located millions of nucleotides away. Transcription factors bind to specific DNA motifs to recruit or block the transcription machinery, depending on the cellular context.

Epigenetic markers such as methylation or histone modifications further modulate accessibility to DNA, influencing gene expression without altering the sequence itself. Signaling pathways, activated by extracellular stimuli, integrate nutrient availability and signal strength to dynamically adjust transcriptional outputs. This dynamic and multi-layered regulatory landscape makes the functional meaning of gene expression profoundly stochastic and context-dependent.

Example: Genomic Sequence and Challenges for Transformers

Consider a genomic region: “ACGTGCCGTA… [1,000,000 nucleotides] …TACGGAACGT”

  • Sparse Functional Elements: Only a small portion, such as a promoter (“TATA box”), or an exon within a protein-coding gene, has functional meaning.
  • Long-Range Dependencies: An enhancer located 1,000,000 nucleotides upstream may regulate a gene downstream. These relationships are not local and cannot be efficiently captured by the Transformer’s attention mechanism at a character level.
  • Three-Dimensional Context: The spatial folding of DNA brings distant elements into proximity, a factor entirely absent in human language and not accounted for by sequence-based models like Transformers.

This lack of proximity and sparse distribution of essential content makes genomic sequences fundamentally different from human language, rendering Transformers inefficient and inadequate.

What Human Language Would Look Like if Similar to a Genome

If human language were structured like a genome, the sentence might look something like this: “TxxCxxxxSxxxxxxxxxxMx… [1,000 characters of irrelevant text] …AxCATxxxTxxMAT”

  • Sparse Functional Elements: Only a few characters, such as “CAT” and “MAT,” carry functional meaning. The rest are non-functional, serving as spacers or remnants of evolutionary redundancy.
  • Long-Range Dependencies: The word “MAT” might depend on a regulatory element like “Txx” located hundreds of characters away, making the relationship non-local.
  • Context Complexity: Relationships between meaningful elements are obscured by vast stretches of irrelevant or redundant content, requiring models to parse enormous amounts of data to find functional connections.

This illustrates the inefficiency of applying tools designed for dense, semantically rich human language to sparse and contextually complex genomic data.

Why LLMs Fail at Functional Genomics

LLMs (or even wrappers and frameworks built on top of LLMs with other hybrid scaffoldings) are ill-suited for genomics because they are optimized for the properties of human language, not the intricacies of genomic architecture. Here are the key reasons:

Sparse Essential Content:

  • LLMs are designed to process sequences where most elements contribute to meaning. Genomes, with only 1% of their sequence being protein-coding, offer almost no such density of essential content. This leads to computational bottlenecks, hacks, alterations and twists in framework where the model wastes resources analyzing non-essential regions.

Long-Range Dependencies:

  • Enhancers, promoters, and other regulatory elements often act on genes located millions of bases away. Transformers are computationally inefficient and conceptually inadequate for modeling these extreme long-range interactions, which are critical in genomics.

Multi-Scale Complexity:

  • Genomes operate across multiple scales, from nucleotide sequences to three-dimensional chromatin folding, with additional layers of regulation imposed by signaling networks, transcription factors, nutrient states, and stochastic biological processes. Transformers lack the capacity to integrate these spatial and hierarchical contexts effectively.

High Sensitivity to Small Changes:

  • SNPs and other mutations can drastically alter function, but LLMs lack the granularity and precision to effectively capture these subtle effects. Their design focuses on probabilistic patterns over large datasets, which makes them ill-equipped for interpreting critical, small-scale variations.

Data Requirements:

  • Training LLMs requires immense amounts of data. While textual data is abundant and easy to augment, functional genomic data is relatively orders of magnitude scarcer. Generating meaningful training sets for genomics is a monumental challenge, requiring experimental validation that cannot match the scale of corpora required to train.

Wasted Computation:

  • LLMs are resource-intensive, requiring enormous computational power for both training and inference. Applying such models to genomics results in astronomical inefficiencies, as the majority of computational effort is spent processing non-functional regions with no useful output.

The Need for Advanced Mathematical Frameworks

Instead of blindly applying LLMs, the field must adopt a mathematically advanced and biologically informed approach. Foundational models must be capable of integrating genomic sequence information with its three-dimensional spatial and regulatory context.

I state this unequivocally in every public discourse: Biology CANNOT be reduced to finite vector embedding or vector-to-vector mappings as seen in most current AI models. Also, even unlike quantum physics (Yes, with a QFT background, I can speak of Quantum Physics as well), where the Schrödinger equation is extremely linear and operates within the Hilbert spaces, biology demands far greater mathematical sophistication. The complexity of biological systems requires modeling within Sobolev spaces, emphasizing function-to-function mappings that capture the true multi-scale, non-linear, and stochastic nature of life.

Advanced mathematical frameworks such as:

  • Stochastic differential equations (SDEs): To model the inherent randomness and noise in gene expression and regulatory dynamics.
  • Variational partial differential equations (PDEs): To describe dynamic regulatory processes and chromatin remodeling.
  • Higher-order graph theory: To capture the hierarchical and non-linear interactions between genomic elements.
  • Algebraic topology: To characterize the spatial organization of chromatin and the looping interactions critical for regulation.
  • Spectral methods and Bio-Informed Neural Operators: To process both spatial and sequence-based features in a computationally efficient manner.
  • Stochastic process modeling: To account for variability and noise in gene expression, integrating both deterministic and probabilistic dynamics.
  • Geometric Deep Learning: Leveraging the structure of manifolds to model the spatial relationships in chromatin interactions and three-dimensional genomic folding.
  • Dynamic Bayesian Networks: Capturing temporal and probabilistic dependencies in signaling pathways and gene regulation.
  • Tensor Networks: Representing high-dimensional genomic interactions in a compressed yet expressive form.
  • Homology Groups in Topological Data Analysis: Analyzing the structural persistence of chromatin loops and regulatory interactions.

These frameworks represent the future of computational genomics, allowing for the unification of sequence-based modeling with higher-order representations and multi-scale dynamics. Bio-Informed Neural Operators, for instance, are designed specifically to incorporate the unique characteristics of genomic data, combining sequence data with spatial and regulatory contexts in a way that Transformers cannot.

The Provocation

Using LLMs for functional genomics is not just inefficient. It is emblematic of a larger issue in computational biology: the tendency to misapply popular methods without critically assessing their relevance. Functional genomics demands tools that respect the sparsity, complexity, and multi-dimensionality of biological data. Treating genomes as sequences of text wastes computational resources and yields limited insight.

Biology is both a systems and a modeling challenge. The solution lies in creating many smaller models that work together within a smarter, more intricate system, rather than relying on a single large model fueled by massive datasets. Big data and big models are effective only when dealing with simpler systems, but this approach will fail for the stochasticity and complexity inherent in biological systems.

Functional genomics is not like human language, and treating it as such is a profound misstep. It is time to move past the hype and embrace models that truly respect the complexity of biology.

You might say, but what about Enformer or Borzoi or Avacodo… etc. Nope, just sticking a CNN ahead to extract features does not cut it for simulating single cell, functional Genomics. (This is a deeper technical discussion for other blogs. Have written enough about them already).

--

--

Meta Multiomics
Meta Multiomics

Published in Meta Multiomics

Original papers, publishings and reviews in Molecular Biology, Genomics and Life-sciences. Deciphering the language of life.

Freedom Preetham
Freedom Preetham

Written by Freedom Preetham

AI Research | Math | Genomics | Quantum Physics

No responses yet