Genomics Is Not Language — LLMs Won’t Work
Stochasticity Defines Life Beyond Patterns
Genomics and language are often compared for their structural similarities. Both involve sequences of symbolic elements, nucleotides in the former and words or characters in the latter. Both rely on patterns, context, and the interplay of local and global dependencies to encode information. At a glance, these parallels make it tempting to view genomics as a biological counterpart to language, a natural domain for the same tools and methods that have revolutionized natural language processing, including large language models. However, this comparison obscures the profound differences between these domains, both in the nature of the data and the mechanisms at work.
Genomics is not a problem of syntax or semantics. It is a problem of stochastic systems, shaped by the intricate dynamics of evolution, biochemical constraints, and cellular machinery. Unlike language, where meaning arises from human-defined rules and cultural conventions, the meaning of a genomic sequence emerges from its interaction with molecular systems and the physical world. The DNA sequence of a promoter does not “mean” anything in isolation. Its function depends on the probabilistic binding of transcription factors, the folding of chromatin, and the cascade of stochastic events that govern gene expression.
This stochasticity is not noise to be filtered out. It is the core of biological systems. In genomics, randomness plays an essential role at every level. The binding of transcription factors to DNA, for instance, is not deterministic but governed by equilibrium constants, energy landscapes, and competition among thousands of molecules. Even gene expression itself is stochastic, with individual genes turning on and off in bursts that vary from cell to cell. Yet, this stochasticity operates alongside deterministic mechanisms such as Watson-Crick base pairing, Mendelian inheritance, and energy minimization within cellular systems. It is the interplay of these factors, randomness and determinism, that defines the genomic landscape.
In language, rules and structure dominate. Even in creative or ambiguous contexts, the constraints of grammar and human cognition provide boundaries within which meaning arises. Language models thrive because of this structure. They predict the next word or phrase by learning patterns across vast corpora, where each word is influenced by a relatively well-defined context. Even in poetic ambiguity, language remains anchored to an overarching framework of shared meaning.
Genomics does not conform to such rules. The same sequence can serve entirely different roles depending on context, such as its position within the genome, its chromatin state, or the cell type in which it resides. Enhancers and silencers, regulatory regions critical to gene expression, act in ways that are highly dependent on the three-dimensional folding of the genome, which itself is influenced by stochastic interactions among proteins and other macromolecules. For instance, an enhancer might regulate a promoter tens of thousands of base pairs away, looping through space to make physical contact with its target. These spatial dependencies arise from molecular dynamics and are far more intricate than the sequential context in language.
The tools used in large language models, such as transformers, assume a world where context is sequential and meanings are hierarchical. In genomics, context is spatial, temporal, and multi-dimensional. Regulatory regions influence transcription through dynamic and probabilistic relationships governed by molecular collisions, energy landscapes, and chromatin folding. No word in language operates within a constantly shifting landscape of physical forces that define its function and interaction. In contrast, genomic elements function within a biophysical framework where energy minimization and molecular interactions determine their roles.
Evolution further sets genomics apart. Language evolves over decades or centuries, shaped by culture and human creativity. Genomic sequences evolve over millions of years, constrained by natural selection, genetic drift, and the need to balance innovation with robustness. Mutations in coding regions are not mere “typos.” They are evolutionary experiments subject to selective pressures. Neutral changes accumulate in silent regions, while functional mutations are pruned or amplified based on their impact on fitness. These evolutionary dynamics embed layers of redundancy, noise, and hidden functionality into the genome, none of which have an equivalent in language.
While large language models excel at learning patterns and context from text, they lack the capacity to encode or reason about the physical and stochastic processes that drive genomic systems. A neural network can predict transcription factor binding sites or RNA splicing events, but these are statistical approximations, not mechanistic truths. For example, transcription factor binding is probabilistic, influenced by factors like binding affinities, chromatin accessibility, and molecular crowding, none of which can be reduced to a simple sequence context.
Genomics demands models that integrate stochastic dynamics with deterministic rules, capturing the interplay of randomness and structure. Mechanistic frameworks rooted in mathematical and physical principles provide a robust foundation for modeling genomic systems. These approaches encode biophysical priors directly into their governing equations, leveraging Hamiltonian systems to describe thermodynamic constraints and employing Langevin dynamics to account for thermal noise in molecular environments. Such models capture the intricate balance of deterministic and stochastic influences within cellular processes, extending beyond steady-state analyses to transient dynamics essential for understanding phenomena like bursty gene expression, epigenetic state transitions, and chromatin remodeling.
Integral and differential operators are intrinsic to these frameworks, enabling the description of spatial and temporal dependencies within the genome. For instance, Green’s functions and non-linear partial differential operators can model the propagation of molecular interactions across three-dimensional genomic structures. Fourier and Koopman operators further enrich these approaches by enabling the decomposition of genomic dependencies into functionally interpretable spaces, capturing long-range regulatory effects such as enhancer-promoter looping or chromatin domain boundaries. These operator-based methods integrate seamlessly with stochastic formulations, such as those described by Itô calculus or stochastic differential equations, to capture the probabilistic nature of molecular collisions and binding kinetics.
By respecting the multi-scale dependencies and boundary conditions dictated by biophysical constraints, these frameworks bridge the gap between empirical data and the physical mechanisms driving genomic systems. Coupling such mechanistic insights with mathematical rigor and innovation provide a pathway toward models that not only predict but also explain the stochastic, dynamic, and spatial behaviors that define genomic regulation.
The genomic context extends beyond sequence patterns. Spatial dependencies, chromatin architecture, and evolutionary constraints demand models that go beyond learning static correlations. Successful models must capture the stochastic and multi-scale nature of genomic systems, respecting the constraints imposed by physics, biochemistry, and evolution. Without these foundations, computational approaches risk misinterpreting the complexity of life as a mere sequence of symbols, failing to grasp the true essence of the biological processes they aim to model.