The Mathematical Foundation to Understand Human Genomic Variation

Freedom Preetham
Mathematical Musings
4 min readApr 13, 2024


Despite 99.9% genetic similarity among humans, the phenotypic diversity we observe — ranging from physical appearance to disease susceptibility — is profound.

I was sharing my insights to a group of sophomore pure math grad students from Berkeley who were very curious on our work at Cognit.AI. In fact, there are a bunch of math blogs in this publication you can track and a whole bunch of biology related blogs in a publication named “Meta Multiomics” if you are interested.

In this blog, I delve into the mathematical principles and models that provides some basic mathematical way of thinking on how genetic and epigenetic factors translate into such diverse outcomes. This tailored for an audience well-versed in mathematics.

Also, the blog should be treated as a way of mathematical thinking on how genomes work. The actual mathematical modeling is far more complex and diverse.

Pre-amble for Biology

Humans share approximately 99.9% of their DNA sequence, yet the resulting phenotypic variation is extensive. The 0.1% genetic difference, representing millions of genetic variations, significantly influences individual traits. We explore this paradox using sophisticated mathematical and statistical models to understand the mechanisms underlying genetic expression and their impact on phenotype.

Genetic Variation and Combinatorial Complexity

Genetic variation arises from single nucleotide polymorphisms (SNPs) and copy number variations (CNVs). Consider the human genome as a sequence space S, with each point representing a possible genome. The metric space (S,d), with a distance function d measuring genetic difference, quantifies genetic diversity.

SNP Variation:

where δ is the Kronecker delta function, and xi​, yi​ represent the alleles at position i in genomes x and y, respectively.

This is the distance function d_SNP​ for single nucleotide polymorphisms (SNPs) between two genomic sequences x and y. The function uses the Kronecker delta δ, which equals 1 if xi​=yi​ and 0 otherwise, summed over all positions i from 1 to n, where n is the total number of positions considered in the genomic sequences.

The Role of Gene Interaction Networks

Genes interact within complex networks, modeled using graph theory with nodes representing genes and edges representing interactions. The dynamics of gene regulatory networks are governed by nonlinear differential equations:

Gene Regulatory Network Dynamics:

where x represents gene expression levels, A is a matrix representing the interaction strengths between genes, and b encapsulates external regulatory influences.

Epigenetic Modifications

Epigenetic changes affect gene expression without altering the DNA sequence. Modeling this, we introduce an epigenetic modification matrix E influencing the gene expression vector x.

Epigenetic Influence on Gene Expression

where E modifies gene expression states, reflecting epigenetic states’ impact on the phenotype.

Stochasticity in Gene Expression

Biological systems exhibit stochasticity. The stochastic nature of gene expression can be modeled using probability distributions:

Stochastic Gene Expression Model

where Xt​ is the gene expression level at time t, Wt​ a Wiener process representing molecular noise, and β a function representing regulatory dynamics.

  • μ represents the mean or drift component,
  • σWt​ denotes the stochastic component driven by a Wiener process Wt​ scaled by σ,
  • Integral of ​β(s,Xs​)ds captures the integral of a function β dependent on both the time s and the state of the process Xs​ up to time t.

Environmental Interactions

Genotype-environment interactions are modeled as modifications of the function mapping genotype to phenotype.

Genotype-Environment Interaction

where Φ represents the phenotypic outcome, g the genetic makeup, e environmental factors, and Ω the sample space with probability measure P.

The function Φ(g,e), models the phenotype outcome as an integral over a sample space Ω. Here, g(ω) represents the genetic makeup at point ω, e(ω) represents environmental factors at point ω, and dP(ω) is the probability measure on Ω. This equation encapsulates how both genetic and environmental factors integrate to determine phenotypic expressions across a population, factored through their probabilistic interactions.

Further Discussion

This exploration of the 0.1% genetic difference using mathematical models reveals a complex landscape of genetic regulation, interaction, and expression, providing insights into the vast potential and variability of human genetics.

The mathematical models presented offer a framework for further exploration and discussion among researchers interested in genetic diversity, phenotypic expression, and the underlying mathematical structures. These models not only enhance our understanding of genetics but also challenge us to refine these models for better predictive power and insights.