Chromatin Identification and Genome Annotation using Computational Methods

Aditya Mittal
Analytics Vidhya
Published in
16 min readApr 19, 2021


A model of a person’s epigenome with relation to their DNA. (Image Source)

The Epigenome and Its Function

What is the Epigenome?

The human body has over 210 cell types, even though each one of those cells stores the same information inside of their DNA. How is this possible? The epigenome stores the information about the cell type and the state of the cell: it creates chemical modifications that influence gene expression.

These modifications can encoded in a variety of different ways:

  • Methylation: Adding a methyl group to certain nucleotides to downregulate gene expression
  • Nucleosomes: Positioning nucleosomes to determine which parts of the DNA are available for transcription
  • Histones: a protein that helps comprise the structure of chromatin

Storing Information in Nucleosomes

The structure of the epigenome. (Image Source)

DNA is tightly packed inside of the body through chromatin and nucleosomes. Nucleosomes are the unit of this packaging, and consist of histone proteins for which the DNA can wrap around. There are four major histone proteins: H2A, H2B, H3, and H4, which all serve various functions in the epigenome.

Nucleosomes encode information about the epigenome in two ways:

  • Chromatin accessibility: As mentioned above, positioning nucleosomes in specific places on the DNA will alter transcription rate of certain parts of the DNA. For transcription, the RNA polymerase has to bind to promoter regions, and the presence of nucleosomes could hinder RNA polymerase’s ability to do so.
  • Histone modification: Histones contain tails that protrude from their central structures. These tails can participate in modification through methylation, acetylation, and other chemical binding. This means that histones can allow the same genome to be transcribed in different ways: there are over 100 distinct histone modifications.

Epigenetic Modifications Influencing Biological Function

Difference between acetylation vs. methylation. (Image Source)

One example of this can be seen in enhancer regions; these enhancers can come into contact with a specific promoter through acetylation or methylation and thus activate the repressed promoter.

This will allow for increased activation of the promoter region, and also provides insight into when DNA transcription will occur by monitoring the movement of enhancer regions in the body.

Overall pipeline for epigenomic identification through computational methods. (Image Source)

ChIP (Chromatin Immunoprecipitation) Technology


ChIP technology helps split DNA into certain fragments whose location on the genome denotes the positions of a transcription factor or histone modification. This is incredibly useful as the first step in identifying chromatin and epigenomic markings in an efficient way (without the need to analyze the entire DNA).

A flowchart of how ChIP works. (Image Source)

ChIP works in the following way:

  • Cells are exposed to a particular cross-linking agent, which allows for the formation of covalent bonds between DNA and histone proteins.
  • Genomic DNA is then isolated from the cell nucleus.
  • This isolated DNA is sheared (preparing DNA for analysis by shearing to a desired fragment range) by enzymes.
  • Antibodies are grown to recognize a histone protein.
  • Antibodies are then added to the solution to immunoprecipitate (isolating a particular protein) and purify the complex.
  • Cross-linking is reversed and DNA fragments are purified for analysis.


Various methods of analyzing after initial chromatin immunoprecipitation. (Image Source)

The result of ChIP is short sequences of DNA, which correspond to specific places in the DNA where histones were bound.

To match these fragments to the genome, fluorescent marks can be used in a process called ChIP-chip. Another approach is to use ChIP-seq to do next-generation sequencing of these fragments; this is used more frequently because it has a much wider range of detection and works quicker.

To infer where a specific histone protein is enriched after ChIP-seq, the DNA fragments must be mapped to the DNA (read mapping) and then regions of statistically significant interest must be found (peak calling). These will be gone into further depth later in the paper.

Alternative — Bisulfite Sequencing

How bisulfite sequencing works. (Image Source)

Bisulfite sequencing is a method for determining the specific regions where DNA is methylated, which allows for the discovery of the histones whose tails the DNA is methylated to.

Bisulfite treatment works by converting unmethylated cytosine to uracil, but doesn’t affect methylated cytosine. Then, genomic DNA with and without bisulfite sequencing can be compared to determine cites where cytosine had been methylated.

Note that false positives can be created with an incomplete conversion of unmethylated cytosine to uracil in bisulfite sequencing.

Quality Control Metrics

Because of the experimental biases associated with ChIP sequencing, the output may be of varying quality. Therefore, quality control metrics create thresholds on the data set as a whole to determine whether the data is of enough quality for further analysis.

Using Input DNA as a Control

Using Input DNA as a control for peak analysis. (Image Source)

Accessible regions of the genome are fragmented more easily, which leads to nonuniform fragmentation. To control this bias, run the ChIP experiment without using an antibody.

This “input DNA” can then be fragmented and mapped to create a sort of background signal track that the immunoprecipitated DNA can be mapped onto and compared against.

Read-Level Sequencing Quality Score

Example of a quality score chart. (Image Source)

Each base pair is associated with a quality score, and the reads given in next-generation sequencing also contain quality scores. A lower quality score means a great probability of mismappings, which means that researchers can set a threshold for rejecting base pairs against.

Fraction of Short Reads Mapped — Multiple

Note each read in the genome can map to one, none, or multiple locations. For multiple locations, there are methods to approach this:

  • Conservative: Not assigning reads to any location (con — can lose read)
  • Probabilistic: Fractionally assigning reads to all locations (con — can create unreal peaks where reads are placed)
  • Sampling: Only selecting one location at random for a read (con — can create unreal peaks where reads are placed)
  • Expectation-Maximization: Map reads based on density of unambiguous reads, where can calculate prior probability that a read maps to that specific region given the densities are constant within each region
  • Paired-End: Determine mapping by sequencing both ends of DNA

Fraction of Short Reads Mapped — None

There will likely be no reads that do not map to the genome; however, those that do not can be checked with a quality control metric. Calculate for each region the fraction of regions that map and for those that don’t meet the threshold, disregard them or rerun the experiment.

This is because these regions likely have lack of assembly coverage.

Cross-Correlation and Library Analysis

Cross-correlation is a complex quality control metric, but essentially, it finds the correlation between the number of reads mapping to the forward strand pass and the number that maps to the reverse strand pass as a function of distance between the two reads.

Library calling is the fraction of reads that are non-redundant. This can be calculated by the NRF or non-redundant fraction number.

Low values for NRF indicate low complexity or that there was insufficient DNA. A general target is 0.8 for around 10 million uniquely mapped reads.

Read Mapping

Read mapping determines the best way to assign a given read to its corresponding place in the genome. Due to the large scale of the human genome, any algorithm that attempts to map read must be incredibly efficient in both space and time complexity and note mismatches from sequencing errors or SNP modifications between different samples.

Past mapping of reads was done in multiple methods:

A brief overview of the Needleman-Wunsch algorithm. (Image Source)
  • Sequence Alignment and the Needleman-Wunsch Algorithm (explained here): Sequence alignment is “the process of comparing two nucleotide strands to find the series of mutations that created one from the other.” A common algorithm used is the Needleman-Wunsch algorithm, which uses dynamic programming to determine the longest common subsequence between two strings and then infer mutations.
How BLAST local/global alignment works with the Karp-Robin algorithm. (Image Source)
  • BLAST and the Karp-Robin Algorithm (explained here): BLAST is used for “matching exact nucleotide sequences to the human genome in sets of billions of nucleotides.” The Karp-Robin algorithm uses hashing for enhanced string matching.

Although both of these algorithms have fast computational runtime, they lack in memory. Sequence alignment has O(mn) runtime, linear time string matching or BLAST has O(m + n) runtime, and suffix arrays have O(m) runtime. However, all of these algorithms have a huge memory requirement of O(mn) time, which greatly hinders the algorithm.

Burrows-Wheeler Transform — Encoding

The Burrows-Wheeler transform runs in O(m) time and requires just O(m) space. It allows for the compression of strings and rearranges the string such that it has adjacent repeating letters next to it.

Here are the steps for the Burrows-Wheeler transform:

  • For a given reference genome, add special characters at the beginning and the end of the string. Generate all possible rotations of this string.
  • Sort these rotations lexicographically with the sorted characters last.
  • Keep the last column of the sorted rotations.

A sample example can be shown with HAPPY.

  • HAPPY becomes ^HAPPY@. All possible rotations of ^HAPPY@ are ^HAPPY@, @^HAPPY, Y@^HAPP, PY@^HAP, PPY@^HA, APPY@^H, and HAPPY@^.
  • The list sorted lexicographically is the following, APPY@^H, HAPPY@^, PPY@^HA, PY@^HAP, Y@^HAPP, ^HAPPY@, @^HAPPY.
  • The last column of the sorted rotations is H^APP@Y.

Another example below shows a Burrows-Wheeler transformation with the word BANANA as its input.

Encoding BANANA into BNN^AA@A using Burrows-Wheeler transform. (Image Source)

Burrows-Wheeler Transform — Decoding

This transformation can also be reversed to compute the original string. The method is as follows:

  • Given the transformed string, sort the strings in alphabetical order. This gives the first column of the transform.
  • Combine the last column with the first to get pairs of characters.
  • Sort the pairs and repeat Steps #1 and #2.

This decoding process allows for a transformation of the genome using a space that is linear in size.

The example below shows how to decode the Burrows-Wheeler transformed BNN^AA@A back into the original string ^BANANA@.

Decoding BNN^AA@A into ^BANANA@ using Burrows-Wheeler transform. (Image Source)

Significance of Burrows-Wheeler Transform

  • All occurrences of the same letter are next to each other rather than being scattered across the genome.
  • The ith occurrence of a character in the first column corresponds with the ith occurrence in the last column of a character. This allows for faster substring search of the Burrows-Wheeled transformed word.
  • Each read is found as a prefix of a rotation in the reference genome, which helps give the position of the read in the genome.

Peak Calling

Example of the peak calling histogram. (Image Source)

Once these reads are created, then signal tracks can be generated. These tracks are ordered into the form of histograms that span the length of the genome, where each point corresponds to the number of reads from the ChIP that were found at that specific point in the genome. More of these reads means that there is a strong presence of the specific epigenetic marker that is being looked for.

Peak calling is finding statistically significant peaks from this histogram to determine which places have the highest expression of specific histone proteins.

Signal Tracks

A sample signal track. (Image Source)

The first step is to generate signal tracks that transform the read counts into an intensity signal that can be placed onto a histogram. Using the cross-correlation analysis from the quality control metric, each read can be sequenced to discover information about the segment for which the read is a part of. Comparing this to the signal tracks for the control data will allow for comparative analysis to determine peaks.

Peak Calling — p-value and eFDR

Comparing different epigenomic samples given by peak calling. (Image Source)

Many peak calling programs, such as MACS and PeakSeq, use p-value and eFDR estimates to determine where the statistically significant peaks lie. Each program uses a different statistical program: MACS uses a local Poission distribution while PeakSeq uses a conditional binomial model.

To model the read count distribution, a Poisson distribution can be used to estimate the expected read count from the control data. Then, the probability of that read count not producing a peak is determined, and if it is below a certain threshold p-value, such as 0.0001, then the genomic region is considered a peak.

For a Poisson distribution, the above calculation is used to determine the p-value. This p-value can then be turned into an empirical false discovery rate (eFDR) by swapping the ChIP experiment data with the input data, similar to what was done in the quality control mapping step.

By analyzing which regions already had increased read counts in the input data, it is possible to filter out the control peaks from the true experimental peaks. For each p-value, the eFDR is simply the number of peaks from the input data divided by the number of peaks from the control.

Irreducible Discovery Rate (IDR)

Using IDR in practice. (Image Source)

One major problem with this is that there is no universal eFDR or p-value that can be used for all calculations, and these values depend heavily on a variety of factors. Additionally, the eFDR is sensitive to small changes in its threshold value.

The Irreducible Discovery Rate or IDR can avoid these related issues, and remove the assumptions regarding the relationships between enrichment and significance in the data. It allows us to find peaks without setting a threshold for significance and uses biological replicates instead.

They rely on the idea that a real signal can be reproducible among replicates, while noise cannot. Therefore, the top N peaks in each replicates are compared against one another, and based on the other replicates, the peaks are chosen for the entire set of testing samples.

The IDR is simply just the fraction of peaks present in the top N peaks in the replicate that are not present in other replicates.

Mathematical Analysis of IDR

Note: This will go heavily into statistics and analyze various distributions.

Since the IDR uses ranks for its peaks, that means that the marginal distributions are uniform, while the joint distributions vary and encode most of the information. These joint distributions can be modeled through a copula model, or a multivariate distribution where the marginal probability of each variable remains uniform.

If C_x is the copula function and F(x) is the cumulative distribution for a certain variable x, then the function F(x) is defined as follows:

We can analyze a Bernoulli distribution K_i ~ Bern(pi_i) given this information that will tell us whether the ith peak is relevant for further analysis or whether it is part of the spurious set.

Let z_{0,i} mean that it’s from the spurious set in biological replicate i, and let z_{1,i} mean that its from the consistent set in biological replicate i.

Then, z_1 = (z_{1,1}, z_{1,2}) if K_i = 1, and z_0 = (z_{0,1}, z_{0,2}) if K_i = 0. Therefore, we can model z_{1,1} and z_{0,1} as the following, noting that each of these points follow a normal distribution and that their values are based on the genomic values given in the read mapping stage.

Note that, in the spurious set, the mean is 0, and the squared deviation is 1. Thus, the z-score is simply equal to the data point from the spurious point in biological replicate i. In the real set, the mean is greater than 0 and its correlative factor is between 0 and 1, which means that its z-score will follow the trivial formula of z = (x-mean)/(standard deviation).

A variable can be modeled with the following formulas to determine what the the weighted mean of a peak being in the spurious set using a normal cumulative distribution function as the probability distribution.

We can then discover for a particular signal in the signal track, what the probability of it being a peak is based on the marginal distributions and the values calculated by the aforementioned equations.

Lastly, we can infer the irreproducible discovery rate using an expectation-maximization algorithm that infers the mean, correlation, standard deviation, and transition probabilities of the consistent set, and then find the probability that P(K_i = 1 | (x_{i,1}, x_{i,2}; theta).

Lastly, to rank each of the various top N peaks from each biological replicate, we can take the maximum values using the following equation:

For a more detailed analysis, reference this article for more information.

IDR Significance and Advantages

  • IDR analysis can be performed with a varying number of peaks N for each biological replicate. Based on this N, a desired IDR can be reached.
  • IDR allows for easier reproducibility between experiments.
  • Other methods, such as the union of two biological replicates (will take both peaks and noise in each dataset) and the intersection of two biological replicates (will miss many genuine peaks), don’t accurately find all peaks within the dataset.
  • IDR combines both of the above approaches by accepting both peaks and determining the irreproducibility rate.

Inferring Chromatin Signatures

Note: This section assumes that you already have previous knowledge of what HMMs are and how they function.

Now, using the read mapping and peak calling algorithms, it is possible to interpret chromatin signatures and derive epigenomic features. This will allow us to do two things:

  • Aggregate known chromatin signals (e.g. H3K4me3)
  • Infer feature types (promoters, enhancers, non-coding regions, etc.)

In a simpler way, DNA can take on a series of hidden states that each emit a specific combination of epigenetic modifications. Using the observed epigenetic modifications, we want to use computational methods to infer feature types about a region of DNA.

HMMs for Chromatin Annotation

Final result of chromatin annotation using HMMs. (Image Source)

We can use HMMs to identify combinations of chromatin marks that are most probabilistically likely to occur due to the given epigenomic modifications. Note that with the hidden states, we must capture the functional ordering of states and the chromatin domain differences across the genome.

Since the transition and emission probabilities are not immediately known, an expectation-maximization algorithm, similar to the Baum-Welch algorithm can be used to find the maximum likelihood values for these parameters.

Emission States — HMMs

Emission states for an HMM with 51 chromatin states. (Image Source)

Rather than emitting a singular value like traditional HMMs do, each state emits a combination of epigenetic marks that is represented as a n-dimensional vector with binary values for each epigenetic mark.

Mathematically, with n input marks, each state k emits a vector (p_k1, p_k2, …, p_kn) of probabilities for marks 1 to n. The probability of observing a set of these marks is the the product of the probabilities of observing individual marks, since each of these are assumed to be independent.

Transition States — HMMs

Transition states must be inferred by an expectation-maximization algorithm. One such result of an algorithm can be seen here.

Transition states for an HMM with 51 chromatin states. (Image Source)

Number of States — HMMs

Choosing the number of states is also vital, as increasing the complexity of the model by too large of an amount can lead to overfitting of the data, which means that the model will not generalize to the population.

Bayesian Information Criterion (BIC) can be used to optimize for the complexity of the model. Note that with BIC, there is a tradeoff between model complexity and interpretability that BIC will not help with. Thus, BIC will always produce a model with more states than the ideal model.

Several steps can be done to account for this:

  • Start with a model with more hidden states than is necessary (e.g. use the BIC amount of states).
  • Prune hidden states until only the states of interest have been captured.
  • Try many random initializations and cross-reference samples to determine whether significance of a state is constant over all samples.
  • Iterate several times until the optimal amount of states is produced.
Final result of the model — chromatin mark combinations associated with state. (Image Source)


  • The epigenome is the set of post-translational chemical modifications or marks that influence gene expression within the human body. Chemical modifications include methylation and acetylation.
  • ChIP, a method to determine where histone proteins bind to DNA, or bisulfite sequencing, a method to determine where methylation occurs, can be used as epigenomic assays to determine places of transcription factors and histone proteins.
  • Quality control mapping seeks to remove biases created by ChIP sequencing to create the most accurate analysis from the epigenomic data.
  • Read mapping seeks to assign a given read to the best matching location inside of the genome. Rather than using traditional algorithms such as the BLAST or Needleman-Wunsch algorithm, the Burrows-Wheeler transform was used due to optimal memory allocation.
  • Peak calling uses specific signal tracks to find regions that produce heightened levels of enrichment. P-value testing and eFDR can be used to make a statistical analysis of significance for peaks, but this can become difficult to use as eFDR values change depending on the experiment.
  • IDR compares which fractions of peaks are not reproducible between different biological replicates rather than using tests of significance, which bypasses the problem mentioned with eFDR.
  • HMMs can be used for interpreting chromatin expression. Bayesian Information Criterion can be used for optimizing the number of states needed for the model. Expectation-maximization algorithms can be used to experimentally determine emission and transition probabilities.

Additional Resources

If you want to talk more, schedule a meeting: Calendly! For information about projects that I am currently working on, consider subscribing to my newsletter! Here’s the link to subscribe. If you’re interested in connecting, follow me on Linkedin, Github, and Medium.