Part 1 — A Rigorous Mathematical Exposition on N-Dimensional Genomic Grammar vs One-Dimensional Linguistic Grammar

Freedom Preetham
Mathematical Musings
6 min readNov 2, 2023

People ask why can’t we just point an LLM architecture to a DNA sequence and just decode the language of life! Just like how we did for human languages. Well the challenge is that, the language of life is far more complex than human language with n-dimensional grammar. Current LLMs are NOT built for that. Hence Cognit.AI is crafting a foundational model called the “Large Genomic Model” from ground up.

But, this is NOT a story of how we employ artificial intelligence to solve for this n-dimensional language in genomics. This is a story of the governing representations in mathematics.

If you are interested in how LLMs decode the human language, you should read “The Sorcery behind GPT — Comprehensive Deconstruction of LLMs!

In the next parts of the series, I intend to cover the following:

Part 2 — Tensor Representation

Part 3 — Algebraic Topology: Charting the Topological Landscape

Part 4 — Differential Geometry: Unveiling the Geometric Structure

Part 5 — Statistical Mechanics: Probing the Dynamic Behavior

Part 6 — Tensor Algebra: Navigating Through Multidimensional Interactions

The realm of genomics presents a complex landscape, filled with intricate interactions and regulatory mechanisms. The conventional one-dimensional grammar, characteristic of human language, falls short in capturing the essence of this complex domain.

The notion of n-dimensional grammar emerges as a robust mathematical framework, capable of navigating through the genomic intricacies. In this blog, I explore the mathematical underpinnings of one-dimensional language grammar and contrast it with the n-dimensional grammar in genomics, and delve into the advanced mathematical exegesis required to navigate the genomic complexities.

One-Dimensional Grammar of Human Language

Markov Chains and N-Gram Models

Let’s attempt the most simplest representation of human language from a mathematical lens. In language modeling, Markov chains provide a probabilistic framework where the probability of each word depends only on a few preceding words. The n-gram models are a manifestation of this principle:

where C denotes the count of occurrences.

Entropy and Information Theory

The entropy H quantifies the uncertainty or the average rate of information in a language model:

Chomsky Hierarchy

The Chomsky hierarchy categorizes grammars into types 0 through 3, with each level representing a different degree of generative power. The one-dimensional grammar typically aligns with the linear and context-free grammars in this hierarchy.

Parse Trees and Formal Grammars

Parse trees provide a hierarchical structure to sentences, while formal grammars define the syntactic rules of a language.

where S is a sentence, NP is a noun phrase, and VP is a verb phrase.

Unfolding the N-Dimensional Genomic Grammar

1. Multidimensional Space

The multidimensional genomic space, Γ, encapsulates various genomic dimensions, each corresponding to a distinct aspect of genomic functionality and regulation.

This is over and above the dimensional qualifications of the human language which is only relevant to the sequence structure of a DNA vector (The long string of ATCGs in the DNA). But, unfortunately, the DNA vector in itself is not the complete part of the language!

With tens of millions of years of evolution, the sequence structure’s grammar got extended and influenced by other dimensions to fully encode the language of life.

Mathematically, this space can be represented as a function of several genomic variables:

where:

G: Gene Interactions,

E: Epigenetic Changes,

T: Transcription,

TF: Transcription Factors,

S: Splicing,

M: Methylation,

H: Histone Modifications,

V: Variants.

1.1 Multidimensional Genomic Vectors: We can define genomic vectors where each component corresponds to a specific genomic dimension:

1.2 Multivariate Functions and Partial Derivatives: To understand the interactions between different genomic dimensions, we employ multivariate functions and partial derivatives:

1.3 Genomic Tensors: Tensors allow representation of multidimensional interactions in the genomic space. A genomic tensor T of rank n can be represented as a multidimensional array with elements dependent on n indices:

1.4 Metric Tensor and Distance Metrics: To measure distances and angles between genomic vectors in this multidimensional space, a metric tensor g can be employed:

1.5 Covariant and Contravariant Tensors: The representation of genomic tensors can change under transformation of coordinates, leading to the concepts of covariant and contravariant tensors:

1.6 Tensor Fields and Differential Operators: Exploring the differential properties of the genomic space necessitates the introduction of tensor fields and differential operators. For instance, the exterior derivative of a tensor field T is given by:

where ∧ denotes the wedge product.

1.7 Genomic Manifolds and Curvature: The concept of manifolds provides a topological framework to explore the genomic space, and the curvature tensor R captures the local curvature properties of the genomic manifold:

These mathematical constructs provide a deep, enriched framework to explore the n-dimensional genomic grammar, paving the way for a more nuanced understanding of the complex genomic landscape.

Part-1 Musings

This part of the long series synthesizes the comparison between one-dimensional linguistic grammar and n-dimensional genomic grammar. While the former is well-established and modeled through probabilistic frameworks such as Markov Chains and encompasses concepts like entropy and formal grammars, it is fundamentally linear and lacks the capacity to encapsulate the multi-faceted nature of genomic data.

On the other hand, the n-dimensional genomic grammar represents a paradigm shift, introducing a comprehensive mathematical framework that delves into the intricacies of genomic functionality and regulation.

It leverages multidimensional spaces, vectors, tensors, and manifolds, employing advanced mathematical tools to unravel the complexity of genomic interactions and regulatory mechanisms.

--

--