Using Hidden Markov Models to Infer Locations of CpG Islands and Promoter Regions

Published in

Analytics Vidhya

15 min readApr 19, 2021

A small snippet of color-encoded nitrogenous bases in the human genome. Credit: Pinterest

The immensity of the human genome presents a pressing issue for computational researchers wishing to make sound conclusions off of the information received from the human body. There are trillions of cells inside of the human body, each containing genetic information that is vital to the processes that a human does.

One current area of interest in particular is finding conserved subsequences or sections of nucleotides that mimic promoter or CpG islands. Because of how much information is in the human genome, it is difficult to make inferences on which regions have been conserved through evolution or which regions do important processes in the body like DNA methylation.

Computational biologists in the field are striving to solve this issue through machine learning techniques like Hidden Markov models. With these models, we can find the underlying structure of these nucleotide sequences and build a heightened understanding of the processes occurring in the human body.

Framework for Modeling DNA

The complexities of DNA leave a lot to be discovered; however, for the purposes of this paper, we’ll mainly just be focusing on the nitrogenous bases and how they can be used to infer function.

There are often three methods that are followed when a nucleotide sequence is given to a group of researchers to figure out the function of the sequence itself (e.g. whether it is a promoter, associated with protein coding regions, etc.): aligning, visualizing, and modeling.

Aligning — Running a database search on well-known parts of the human genome (using the BLAST algorithm to find local sequences) or utilizing clustering algorithms like k-means or hierarchical clustering to assemble unknown pieces of DNA inside of the human body
Visualizing — Simply checking for disparities and nonstandard compositions inside of the nucleotide sequences usually through analyzing the open reading frames, stop codons, and different k-mer frequencies.
Modeling — Making a hypothesis through generative models to infer the function of the sequence through past knowledge.

Through HMMs, we will tackle the latter strategy in reducing computational runtime and delivering output faster. We will not only look at what these sequences potray but also recognize the state that generated the observation.

Baye’s Rule and Probabilistic Modeling

Hidden Markov models strictly rely on Baye’s Rule and probability formulas to run its calculations and infer the correct sequence. Therefore, it is pertinent to know the fundamentals before exploring deeper into this topic.

The formula shown above is vital in understanding specifically how the probability of an event changes given some new information. The notation P(A|B) discusses the probability of event A occurring based on the assumption that event B has already occurred.

The exact derivations of the formula are irrelevant for the uses of this paper; however, understanding this rule and how it works is necessary as it plays a large part in Hidden Markov models itself. Note that Baye’s Rule can constantly be compounded depending on the new information given.

Intuition Behind Markov Chains and HMMs

Goal of our HMMs — predicting functions off nucleotide sequences.

HMMs can be described through two states: the hidden and the observed state. The hidden state is what causes these emissions and leads to the aforementioned observations that are shown above.

An example can be shown through how the seasons play an impact on weather. In this case, the observed states would be whether it is sunny or whether it is raining while the hidden states would be the season itself. Note that the seasons would have a predefined probability of whether they would rain or not. Seasons like the fall would have a larger probability of raining than events like the summer.

In the example above, define the emissions/observations to be the set of sun, rain, clouds, and snow while the hidden states are the seasons (summer, fall, winter, and spring). As shown below these can be defined by the vector pi and x respectively. The transition probabilities (meaning the probabilities of one hidden state going to another hidden state) is represented as well.

A simplistic Markov chain for the weather prediction model example. Source: MIT

These transition probabilities are often generated based on predefined inputs by the user as well. It is defined as the probability of entering this hidden state given the probability of the previous state being the hidden state k.

These Markov chains are a step inside of the overall HMM, and they are often defined as a triplet (Q, p, A) where Q represents the set of states, A is the transition probability matrix, and p is the vector of initial probabilities. We’ll be going more into initial probabilities later.

The thing that makes Markov chains so popular to computational biologists is its memory-less state, only depending on the previous state for its computations, leading to an extremely small space complexity, compared to other traditional algorithms.

Difference Between Markov Chains and HMMs

Example of a Hidden Markov model. Credit: Medium

In the weather example, the so-called “hidden” states were states that were already known by the researchers. Every individual knows the emission probability of a season towards a certain weather pattern as meteorological data has given us that data over time.

In Hidden Markov models, these observations or emissions would result from a particular set of probabilities. The states would be hidden from the observer, and there would be observations that are generated with certain probabilities in each state (often called emission probabilities).

Simply put, HMMs are used when the observations/emissions come from a set of states from systems that we are unable to observe. This is different from Markov Chains, which have predefined states already.

Mathematical Notation of HMMs

Hidden Markov models are formally defined as a 5-tuple (Q, A, p, V, E) compared to the 3-tuple Markov chain where V is the set of observation symbols (usually the set of nucleotide bases or amino acids) and E is the matrix of emission probabilities, often defined as:

Emission probability of observation symbol k GIVEN the state s (inside the matrix as row s and column k).

Example of HMMs

Pictoral representation of a Hidden Markov Model. Credit: MIT

A predefined HMM model will commonly look like something seen above. For each nitrogenous base, there’s a correlating probability relation that shows how likely the hidden state is to emit that certain nitrogenous base.

With certain sequences (e.g. ACCCATAC), we can determine what the probabilities are for both the self and virus hidden states and see which hidden state is most likely occur given the emission sequence.

Note that there are three possibilities: the hidden state does not change from the self state, the hidden state does not change from the virus state, and the hidden state changes from self and virus states.

This last probability, however, is often not recommended, as the transition probabilities between the different states are often lower than the transition probabiltieis that simply circle back to the same state (e.g. Self → Virus transition probability is 0.05 while Self → Self transition probability is 0.95).

To calculate the aforementioned probability of these states and observed emissions, we need to take the product of three terms:

The probability of starting with the self/virus hidden state
The probability of having those specific emission probabilities if we a predefined particular state
The transition probabilities (either staying in the same state or switching)

It is often defined as P(x, pi) = P(x|pi)P(pi) as noted by our Baye’s Rule where the x is the emission vector and pi is the hidden vector.

In short, the probability of having an emission probability given a certain state (i.e. P(x, pi)) is the multiplication of the initial state probability and the emission and transition probabilities depending on the emission vector.

Example of Probability Calculation

Taking an example of pi = {S, S, S, S, S} and x = {A, T, C, G, A} where S indicates that the self-state is being used, then we can calculate the joint probability of having both pi and x to be:

P (x, pi) = (0.5) x (0.25)⁵ x (0.95)⁴ = 3.977 x 10^(-4)

Note that this was simply the multiplication of the aforementioned initial probability, transition probabilities, and emission probabilities.

These probability calculations are usually small because of how many small numbers are being multiplied with one another. We often just look at the one with the highest probability and consider that our most likely state.

Another common example between the loaded versus fair die, calculating the probability of a pi vector of {F, F, F, F, L, L, L, L, F, F} and an emission vector of {1, 2, 1, 5, 6, 2, 1, 6, 2, 4}.

Interpreting HMM Probabilities

Looking at the example above, for five nucleotide sequences, there are 2⁵ hidden state sets that can occur between the self and the virus state.

Therefore, the HMM runs through all 32 combinations in order to find the state that is the most likely (i.e. the one with the highest joint probability). Note that these 32 combinations also include transition probabilities that transfer between different states.

Therefore, a generalized assumption of 2^n hidden state sets need to be run on depending on an emission sequence of length n. As seen before, this can get computationally taxing with even sequences of just 100 base pairs. Therefore, stronger methods of analyzing are necessary.

Graph of time complexities. Instead of exponential, we want polynomial or logarithmic runtime.

GC-rich regions with HMMs

Now that the overlying intuition of HMMs are understood, a discussion on how they can be used to solve biological problems can be performed.

HMMs are revolutionary as they can predict potential introns, exons, transcription factors, etc. just on a single piece of DNA. Specifically, for our purposes we’ll be looking at GC rich regions to model these sequences.

GC-rich regions and CpG islands

Picture of a CpG site. Credit: Wikipedia

These regions are often denoted GC-rich if they are filled with cytosine and guanine (two complementary nitrogenous bases).

These are often defined as regions with less than 200 base pairs, a GC percentage of higher than 50%, and an observed-to-expected CpG ratio of greater than 60%. Note that these numbers can be implemented into our HMM model to check for specific CpG islands.

Using our understanding of HMMs, we can say that in these GC-rich regions will have higher emission probabilities for cytosine and guanine compared to adenine and thymine. Often, these promoter regions have probabilities of 30% and 42% for G and C and simply 15% and 13% for A and T.

Based on the emission probabilities, we now want to find the optimal state (which consists of both promoter or CpG regions and non-promoter or background regions) that maximizes the overall joint probability of the hidden states and the emission states.

This can be done through dynamic programming, which will be discussed later in this paper.

Probability of an arbitrary sequence path with all background states.

Probability of an arbitrary sequence path with a mixture of background and promoter states.

Again, talking about the issue discussed beforehand, the maximal joint probability gets increasingly difficult to calculate as the sequence length increases, so it’s necessary to use dynamic programming in order to reduce the computational runtime of the HMM.

Decoding the Most Likely Path — Viterbi Algorithm

A brute force approach for finding this joint probability would be less fruitful due to the exponential number of paths that have to be checked, making it extremely impractical and time consuming. However, through dynamic programming, we can minimize this time complexity and arrive at a better solution.

Dynamic programming aims to break up the problem into smaller subcategories, like a recursive loop. They are usually used for optimization problems, using the fact that the optimal solution for the problem depends on the optimal solution for the subproblems itself as well.

Algorithms like the Needleman-Wunsch algorithm use dynamic programming to find similar nucleotide sequences (accounting for insertions, deletions and mutations); for more information, check out this article.

Our main goal for this problem is to try to find the most likely states based on the observations or the emission probabilities. Note that in this problem, the emission probabilities (E), the transition probabilities (a), and the sequence of emissions (x) are all given. We are trying to find the sequence of hidden states (pi) that maximizes the joint probability given these emissions, or find

Note that dynamic programming can be used for this problem as the best path can be obtained off of the best path of previous states; we simply need to form a recurrence relation to find the optimal path. Let v(i) be the known probability for the most likely path that ends on position i for a specific state. This recurrence is defined as:

This just says that the next known probability of the most likely state is given by the maximum probability between the transition probabilities and the previous state and the emission probability of the current state.

Essentially, the main thing getting maximized is the transition from all possible previous states to another particular state, which is then multiplied by the emission probability itself. This maximization can be checked through backwards pointers, which can be backtracked to find the most optimal path.

Here are the four main steps of the Viterbi algorithm (mathematically):

The first step, initialization, assigns the starting column a probability of one (since no emission or transition probabilities have been multiplied on it), while all other columns k have probabilities of zero.

The second step, recursion, runs through the recurrence relation discussed before and simply stores it inside of a pointer that will be accessed later when backtracking through our dynamic programming algorithm.

The third step, termination, finds the best path by backtracking until it finds the most optimal path sequence path; this is the one with the highest maximum probability following all the way to the first column.

The last step, traceback, simply assigns those arbitrary paths pointers depending on what their value was inside of the hidden matrix itself.

Visual representation of the dynamic Viterbi algorithm.

With K states and N emission states, we can say that there is a space complexity of O(KN) and a time complexity of O(K²N), much better than the exponential runtime of the brute force algorithm.

Decoding Over All Paths

Further research has been done in the field of decoding over all paths, where we find the path of states pi* that contains the most likely state at each time point. This means the state that has the lowest multiplicative probability (including previous probabilities and transition and emission probabilities).

This is done through a process of posterior decoding which seeks to find probabilities for the underlying states of individual positions rather than whole sequences of positions. The main recursive formula for posterior decoding through dynamic programming is:

Note: Posterior decoding can give an invalid sequence of states because it isn’t considering the likelihood of the transitions between the states but more so doing it for each of the positions independently.

Spatial representation of the posterior decoding algorithm.

Posterior decoding and the Viterbi algorithm both have their separate strengths and weaknesses; however, the Viterbi algorithm is most commonly used for inferring functions based on nucleotide sequences.

Posterior decoding is not relevant for the purposes of our paper, so it won’t be discussed; however, more information can be found here.

Scoring Over Multiple Paths

We already know how to calculate the probability of one path using the joint probability calculations that were explained above, exploiting the understanding of the initial, emission, and transition probabilities.

For scoring over multiple paths, a similar approach can be taken as well. We often use this score if we are particularly interested in knowing the likelihood of a particular sequence given a Hidden Markov Model.

Instead of focusing on the hidden states, it instead focuses on the observed states by simply summing up all of the 2^n hidden sets with one another. The probability of all paths pi of hidden states of the given sequence of observations is given by the formula:

Using a brute force approach, this is an exponential runtime, but dynamic programming can be used to solve this issue through the Forward Algorithm.

This follows a similar initialization process and iteration process to the Viterbi algorithm; however, instead of taking the maximum of the previous probability and the transition probability, it takes the summation instead.

This iterative process accounts for all of the probabilities that are there by multiplying both the transition and emission probabilities in all encounters. Lastly, it terminates by summing all of the element probabilities to arrive at a final joint probability. It is given by the same space:

Algorithmic Comparisons

A comprehensive overview of the different algorithms discussed.

These algorithms play a significant role in scoring and decoding through these Hidden Markov models. Further supervised and unsupervised models can be taken into account in order to make this process more generative; however, this won’t be discussed for the purposes of this paper.

For more information about the supervised and unsupervised learning tactics to infer functions of nucleotide sequences, refer to the Baum-Welch algorithm and Viterbi training. More information can be found here.

Genomic Applications

Hidden Markov models can be used in a variety of different scenarios, including the detection of protein coding conservation areas, the detection of chromatin states, and the detection of protein coding gene structures.

CpG Islands and HMMs

CpG islands, or regions that have pairs of C and G nucleotides on the strand, is the main focus of this paper. When this is present in a genome, it often becomes methylated where the deamination of the cytosine leads to a thymine (causing a C to T mutation).

Because of this methylated, these CpG islands deplete over evolutionary time, and they are often rare because of the high mutation frequency of cytosines. However, when they are next to an active promoter, this methylation function can be suppressed and the CpG islands usually survive.

Evolutionary process of CpG islands and how they degrade over time. Credit: Wikipedia

Therefore, detecting these CpG islands can highlight specific promoter regions inside of the human genome. Traditional applications of using reading frames to detect high levels of Cs and Gs are obsolete as too small of a window may not capture the CpG island while too large of a window might lead to the CpG island being missed. Therefore, HMMs are the key to modeling this.

One issue that often persists with these Markov models is the problem of overfitting with the increasing of these parameters, which can overfit the data and lead to less accuracy. Therefore, a reduction of parameters is necessary, either by making all transition probabilities the same value or simply making the transitions on the states that we are focused on.

Summary

HMMs are growing to be a novel method in the field of bioinformatics in inferring functions of nucleotide sequences through Baye’s rule and probability calculations. HMMs often have a hidden state and observed state, where a specific hidden state has a probability of emitting a certain observed state, often denoted as the emission probability.

The joint probability of the entire observed state given a hidden state is calculated by multiplying the initial probability and the emission and transition probabilities (probability of a hidden state going to another hidden state). For finding the score over all hidden paths of a certain sequence, the summation of these joint probabilities can be calculated through dynamic programming and the Forward Algorithm.

Going the opposite way, by decoding functions based on nucleotide sequences: two approaches can be done. The Viterbi Algorithm can be used to find the most likely hidden path between a nucleotide sequence while posterior decoding can be used to find the path containing the most likely state (not path) at any time point.

Further machine learning techniques can be done to make this approach more generative and less predefined through Viterbi and Baum-Welch training.

Additional Resources

Article: Introduction to GHMMs
Textbook (Chapter 8/9): Computational Biology: Genomes, Networks, and Evolution (MIT course 6.047/6.878)
Research Paper: Biological Sequence Analysis
Research Paper: Finding the Genes in Genomic DNA
Research Paper: What is A Hidden Markov Model?
Research Paper: CpG Islands and Genes
Research Paper: Redefining CpG Islands Using Hidden Markov Models
Research Paper: Predicting CpG Islands and Their Relationship with Genomic Feature in Cattle By Hidden Markov Model Algorithm
Research Paper: Higher-Order Hidden Markov Models With Applications To DNA Sequences
Research Paper: The Viterbi Algorithm
Research Paper: An Efficient Forward-Backward Algorithm For An Explicit-Duration Hidden Markov Model
Research Paper: A Modified Baum-Welch Algorithm For Hidden Markov Models With Multiple Observation Spaces
Research Paper: Bridging Viterbi and Posterior Decoding: A Generalized Risk Approach To Hidden Path Inference Based on HMMs

Hi! I am a 16 year old currently interested in the fields of machine learning and biotechnology. If you are interested in seeing more of my content and what I publish, consider subscribing to my newsletter! Check out my March newsletter here! Also, check out my LinkedIn and Github pages. If you’re interested about personal mindsets or just stuff in general, sign up for a chat using my Calendly.