Chapter 1: Finding Replication Origins in Bacterial Genomes

Phillip Compeau
Programming for Lovers
36 min readOct 27, 2019

If you enjoy this work, please join my ongoing programming class for charity: https://canvas.instructure.com/enroll/RKFKKP

The most beautiful experiment in biology

Before a cell can divide, it must first replicate its genome, or the sum total of its DNA, so that each of the two daughter cells inherits its own copy. In 1953, James Watson and Francis Crick completed their landmark paper on the DNA double helix with what would become a famous prediction about genome replication:

It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.

They conjectured that the two strands of the parent DNA molecule unwind during replication, and then each “parent” strand acts as a template strand for the synthesis of a new complementary strand of “daughter” DNA (see figure below). Their hypothesis is an interesting one, but is it true?

Watson and Crick’s hypothesized view of DNA replication, with template strands shown in green and complementary strands shown in yellow. Nucleotides adenine (A) and thymine (T) are complements of each other, as are cytosine (C) and guanine (G). Complementary nucleotides on opposing strands of DNA bind to each other.

During the 1950s, scientists debated three proposed models for DNA replication, illustrated in the figure below. The semiconservative hypothesis was Watson and Crick’s suggestion that each parent strand acts as its own template for the synthesis of a complementary strand. The conservative hypothesis proposed that the entire double-stranded parent DNA molecule serves as a template for the synthesis of a new DNA molecule, resulting in one molecule with two parent strands and another with two daughter strands. The dispersive hypothesis proposed that some mechanism breaks the DNA backbone into pieces and splices intervals of synthesized DNA, so that each of the four resulting strands is a patchwork of parent and daughter DNA.

Semiconservative, conservative, and dispersive models of DNA replication make different predictions about the distribution of DNA strands after replication. Yellow strands indicate 15N (heavy) segments of DNA, and black strands indicate ¹⁴N (light) segments. The Meselson-Stahl experiment began with DNA consisting of 100% ¹⁵N.

The debate would eventually be resolved five years later by Matthew Meselson and Franklin Stahl, in what has become known as “the most beautiful experiment in biology”. Their insight was that one isotope of nitrogen, Nitrogen-14 (¹⁴N), is lighter and more abundant than Nitrogen-15 (¹⁵N). Knowing that DNA naturally contains ¹⁴N, Meselson and Stahl grew E. coli for many rounds of replication in a ¹⁵N medium, which caused the bacteria to gain weight as they absorbed the heavier isotope into their DNA. When they were confident that the bacterial DNA was saturated with ¹⁵N, they transferred the heavy E. coli cells to a less dense ¹⁴N medium.

STOP: What do you think happened when the “heavy” E. coli replicated in the “light” ¹⁴N medium for one generation?

After one round of replication, the three hypotheses agreed that half of the nitrogen in the DNA would be the ¹⁴N isotope, but they disagreed on how this ¹⁴N would be distributed over the DNA molecules. The conservative model predicted that half of the double-stranded molecules of DNA would still have only ¹⁵N and therefore be heavier, whereas the other half would have only ¹⁴N and be lighter. Yet when they used a centrifuge to separate the E. coli DNA according to weight after one round of replication, all of the DNA had the same density! Meselson and Stahl had just refuted the conservative hypothesis.

STOP: Note that both the dispersive and semiconservative hypotheses predicted that all of the DNA after one round of replication would have the same density. What would these models predict about the density of E. coli DNA after two rounds of replication?

The dispersive model predicts that after two replication cycles, every strand of DNA should contain about 25% ¹⁵N and about 75% ¹⁴N. In other words, all the DNA should still have the same density.

And yet when Meselson and Stahl spun the centrifuge after two rounds of E. coli replication, they observed two different densities of DNA. This is exactly what the semiconservative model predicted. After one cycle, every cell should possess one ¹⁴N strand and one ¹⁵N strand; after two cycles, half of the DNA molecules should have one ¹⁴N strand and one ¹⁵N strand, while the other half should have two ¹⁴N strands, producing the two different densities that they noticed.

STOP: What does the semi-conservative model predict about the density of E. coli DNA after three rounds of replication?

Meselson and Stahl had refuted the conservative and dispersive hypotheses of replication, and yet they wanted to make sure that the semiconservative hypothesis was confirmed by further E. coli replication. This model predicted that after three rounds of replication, one-quarter of the DNA molecules should still have a strand consisting of 15N, so that 25% of the DNA would be heavier, whereas the remaining 75% would have only 14N and be lighter. This is indeed what Meselson and Stahl witnessed in the lab.

We must point out that Meselson and Stahl’s experiment does not offer the concrete finality of the mathematical proofs that we saw in the Prologue with the work of the Ancient Greeks. After all, they only established the semiconservative hypothesis for one organism of E. coli; this hypothesis may be violated for other E. coli organisms, or other bacteria, or in turn to eukaryotic organisms like you and me. Nevertheless, the semiconservative hypothesis has been witnessed in all living things for six decades.

Formulating a computational problem to find a replication origin

But where is the computation? When a computer scientist looks at a strand of DNA, they infer that it can be represented by the order of its nucleotides. To represent a strand of DNA letters, we will use a string, or a collection of symbols joined into a contiguous “word”. Symbols and strings are built-in variable types in most programming languages, just like integers, decimal numbers, and boolean variables.

We also note that if we know one strand of DNA, then we will automatically know the complementary strand because of base pairing: adenine always pairs with thymine, and cytosine always pairs with guanine. As a result, we only need one string to represent a double-stranded DNA molecule. To be precise, we use the term DNA string to refer to a string of nucleotides from the four-letter alphabet {A,C,G,T}.

The problem of determining the DNA string making up an organism’s genome, or genome sequencing, is its own very challenging problem, and one that I have discussed at length elsewhere. Sequencing the genome of an organism like E. coli (whose genome consists of 4.6 million nucleotides) was a substantial achievement in 1997, and three years later came the first draft of a human genome (3 billion nucleotides), an achievement that cost $3 billion and that led to a boom in vastly cheaper sequencing technologies.

Still, a computer scientist might not imagine that DNA replication has any computational interest — we only need to take a string corresponding to a DNA strand and return two copies of it! Yet if we take the time to review the underlying biological process, we will be amazed at the complex symphony coordinating genome replication, as well as how computation can help us answer biological questions about replication.

Throughout this chapter, we will consider bacterial genomes, which consist of a single circular chromosome. Bacterial genome replication begins in a single genomic region called the replication origin (denoted ori) and is performed by molecular copy machines called DNA polymerases that attach free-floating nucleotides to the growing strand of DNA, in keeping with the semiconservative hypothesis.

Our goal, then, is to determine where ori is hiding in the genome of a bacterium like E. coli, which consists of around 3 million nucleotides. As we have done in the Prologue, we could state this as a problem in terms of input and output.

Origin of Replication Problem

Input: A DNA string genome.

Output: The location of ori in genome.

To a laboratory biologist, the Origin of Replication Problem has a straightforward solution: hack out one short segment from the genome at a time until we find a region where replication is disrupted. However, these types of “knockout” experiments take time to design and implement, and they are not guaranteed to be accurate. What happens, for instance, if we find several regions of the genome whose deletion disrupts replication?

Yet a computer scientist shakes their head and points out that we don’t have a clearly defined computational problem because we haven’t defined precisely what we are looking for to qualify as ori. In other words, what precisely characterizes the replication origin that will help us train a computer to find it?

A second biological problem

Our plan for finding ori in bacterial genomes is to begin with a bacterium in which the location of ori has been found experimentally, and then to determine what makes this genomic region special to design a computational approach for finding ori in other bacteria. The species that we will used with an experimentally verified ori is Vibrio cholerae, the bacterium that causes cholera. The nucleotide sequence appearing in its ori is shown below:

atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

How does the bacterial cell know to begin replication exactly in this short region within the much larger Vibrio cholerae genome, which consists of 1,108,250 nucleotides? There must be some “hidden message” in the ori region telling the cell, “Begin replication here!” The question is how to find this hidden message without knowing what it looks like in advance.

Hidden Message Problem

Input: A string text (representing the replication origin of a genome).

Output: A hidden message in text.

Unfortunately, the Hidden Message Problem is still not a computational problem because the notion of a “hidden message” may make sense in terms of human language, but it has not been defined precisely in terms of concepts that we can use to program a computer.

We now have two biological problems to solve.

  1. Given a shorter ori within a longer genome, what is the hidden message indicating that replication should start in this region?
  2. Given a bacterial genome, where is ori?

Our goal in this chapter is to formulate these two biological problems as computational problems and to develop algorithms that can quickly and accurately find where ori is lurking in the genomes of thousands of bacterial species, along with identifying the hidden message making each ori special. In so doing, we will provide you a glimpse into the rapidly growing field of computational biology, in which computers are answering questions about biology that we could have only dreamed of answering experimentally.

Hidden Messages in the Replication Origin

Counting words

We will begin with our first problem and look for the hidden message that makes an ori region special.

Various biological processes involving DNA require proteins to bind to the DNA. For example, a transcription factor binds to a specific DNA sequence and initiates the process of transcribing the DNA template into RNA, which is then sent out of the cell nucleus and used to produce other proteins. Many DNA-binding proteins “read” the DNA and will only bind to DNA when they detect a specific “keyword”, i.e., DNA string.

STOP: For a given DNA-binding protein, would it make sense for an organism to have multiple occurrences of the DNA keyword, or just one?

The more occurrences of the nucleotide string in the desired location of the genome, the more likely that binding will successfully occur (and the less likely that a mutation will disrupt the binding process). Therefore, a bacterium will most likely have evolved to have multiple occurrences of the nucleotide string in the region of the genome where a given protein binds with DNA. This is but one more illustration of the famous Theodore Dobzhansky quote, “Nothing in biology makes sense except in the light of evolution.

In the case of replication, we know that replication initiation is mediated by DnaA, a protein that binds to a short segment within the ori known as a DnaA box. Let’s see, then, if we can find any surprisingly frequent “words” within the ori of Vibrio cholerae that might be DnaA boxes. First, we will consider the problem of counting the number of occurrences of a given pattern in a text. If a string appears within a longer text, we say that it is a substring of the text. This brings us to the following computational problem.

Substring Counting Problem

Input: A string pattern and a longer string text.

Output: The number of times that pattern occurs as a substring of text.

We will account for overlapping occurrences of pattern in text. For example, we will say that ATA occurs three times in CGATATATCCATAG, not twice.

Our plan is to “slide a window” down text, checking whether each length-k substring of text matches pattern, and adding one to a count variable every time we encounter a match (see figure below). It is just a matter of converting this idea into a pseudocode function PatternCount solving the Substring Counting Problem.

Sliding a window to compute PatternCount(text, pattern) = 3 for pattern = “ATA” and text = “CGATATATCCATAG”. We initialize count to zero and then increment it each time that pattern appears in text (shown in green).

Continuing our 0-based indexing from the Prologue, we might think of a string as similar to an array of symbols, so that text begins at position 0 and ends at position len(text)−1; the “length” function len for counting the number of symbols in a string (or the number of elements in an array) is typically built-in to programming languages.

The notation that many programming languages use for the length-k substring of text starting at position i of text is text[i, i + k]. For example, if text is "GACCATACTG", then text[2, 8] is "CCATAC", and text[4, 6] is "AT". This notation is convenient for two reasons. First, note that the length of the substring text[i, j] is always equal to j - i, and so we will immediately know that text[2, 8] has length equal to 8–2 = 6. Second, we can always infer the final index of the substring in text by subtracting 1 from the higher index. That is, the string text[2, 8] is the substring of text that begins at position 2 and ends at position 7.

Before continuing, we also note that the same notation applies to subarrays (contiguous arrays inside arrays). If a is an array, then a[i, j] is the subarray of a of length j - i that starts at index i and continues up to and not including index j.

STOP: First, what is the notation for the three substrings of text in the figure above that are equal to "ATA"?

Second, how many starting positions are there for substrings of length k in a string of length n?

On the heels of the preceding exercise, note that the starting positions of length-k substrings of text range from 0 up to and including len(text) − k. For example, the last 3-mer of "GACCATACTG" starts at position 10 − 3 = 7. This discussion results in our desired pseudocode function solving the Substring Counting Problem.

PatternCount(pattern, text)
count ← 0
n len(text)
k len(pattern)
for every integer i between 0 and n k
if text[i, i + k] = pattern
count
count + 1
return count

The Frequent Words Problem

Now that we can count the number of times that a given pattern appears in a longer string, we will return to our original problem of finding patterns that occur frequently.

We will apply the term k-mer as shorthand to refer to a string of length k. We say that pattern is a most frequent k-mer in text if it has the largest value of PatternCount(text, pattern) among all k-mers. You can see that ACTAT is a most frequent 5-mer for ACAACTATGCATACTATCGGGAACTATCCT, and ATA is a most frequent 3-mer for CGATATATCCATAG.

STOP: Can a string have multiple most frequent k-mers?

We now have a rigorously defined computational problem. Before continuing, you might like to brainstorm how we can solve it using an array. What subroutines might be useful?

Frequent Words Problem

Input: A string text and an integer k.

Output: All most frequent k-mers in text.

Many algorithms will solve the frequent words problem. For example, we might try to generate all possible k-mers, and generate an array whose i-th value is the number of occurrences of the i-th k-mer. This approach is likely inefficient unless k is very small because the number of possible k-mers grows very quickly in terms of k (how many k-mers can you form from the DNA alphabet {A, C, G, T})?

Another array-based approach proceeds as follows.

  1. Create an array count of length len(text)-k+1.
  2. For each i, set count[i] equal to the number of times text[i, i+k] appears in text. That is, count[i] should be equal to PatternCount(pattern, text), where pattern is text[i, i+k].
  3. Consider the maximum values of count[i]. For any i achieving this maximum, the substring text[i, i + k] is a frequent k-mer.

For example, the array count for text ="ACGTTTCACGTTTTACGG" and k = 3 is shown in the figure below. Note that the maximum value is achieved six times at the indices 0, 3, 7, 10, 11, and 14. The indices 0, 7, and 14 correspond to the three starting positions of "ACG", and the indices 3, 10, and 11 correspond to the three starting positions of "TTT".

The count array for text = ACGTTTCACGTTTTACGG and k = 3. For example, count[0] = 3 because the 3-mer starting at position 0 (ACG) appears three times in text (at positions 0, 7, and 14). Accordingly, count[7] and count[14] are both equal to 3 as well.

Once we have generated the count array of text, we know that the most frequent k-mers in text will be those whose corresponding entries in count are the largest. We should first write a function to find the maximum value of an array.

STOP: Write a pseudocode function MaxArray that takes an array of integers as input and returns the maximum integer value of the array.

We are nearly ready to write pseudocode for a function FrequentWords solving the Frequent Words Problem. This function will leverage the idea from ListPrimes in the Prologue of forming an empty array freqPatterns to which we will add any frequent words that we find. However, note that when we range through count, we will encounter multiple indices in the array corresponding to the same substring (recall that indices 0, 7, and 14 all correspond to "ACG" in the above example). As a result, we should only append a string text[i, i+k] to freqPatterns if it is not already present in freqPatterns. This discussion motivates the following exercise.

STOP: Write a pseudocode function Contains that takes an array of strings strings and a string pattern as input; your function should return true if pattern occurs as an element of strings , and false otherwise.

We are now ready to present the function FrequentWords that implements our array-based idea for finding frequent k-mers in a string.

FrequentWords(text, k)
freqPatterns ← an array of strings of length 0
n len(text)
count ← an array of integers of length n k + 1
for every integer i between 0 and n k
pattern
text[i, i + k]
count[i] ← PatternCount(text, pattern)
max MaxArray(count)
for every integer i between 0 and n k
if count[i] = max
pattern
text[i, i + k]
if Contains(freqPatterns, pattern) = false
freqPatterns append(freqPatterns, pattern)
return frequentPatterns

STOP: The FrequentWords algorithm is inefficient; why? How could we improve it?

A Faster Frequent Words Approach

If you were to solve the Frequent Words Problem by hand for a small example, you would probably form a table like the one in the figure below for text equal to"ACGTTTCACGTTTTACGG" and k equal to 3. You would slide a length-k window text, and if the current k-mer substring of text does not occur in the table, then you would create a new entry for it. Otherwise, you would add 1 to the entry corresponding to the current k-mer substring of text. We call this table the frequency table for text and k.

A table corresponding to counting the number of occurrences of every 3-mer in text = ACGTTTCACGTTTTACGG.

In the previous FrequentWords algorithm, we also make a single pass down text, but each time we encounter a k-mer window, we call the PatternCount subroutine, which requires its own pass down the entire length of text. But when we build a frequency table, we make one pass down text, and every time we encounter a k-mer, we simply add 1 to the k-mer’s count.

We know that an array of length n is an ordered table of values, where we access the values using the integer indices 0 through n-1. The frequency table is a generalized version of an array called a map or dictionary for which the indices are allowed to be arbitrary values (in this case, they are strings). More precisely, the indices of a map are called keys.

Given a map dict, we can access the value associated with a key key using the notation dict[key]. In the case of a frequency table called freq, we can access the value associated with some key string pattern using the notation freq[pattern]. The following pseudocode function takes a string text and an integer k as input and returns their frequency table as a map of string keys to integer values.

FrequencyTable(text, k)
freqMap ← empty map
n len(text)
for every integer i between 0 and n k
pattern
text[i, i + k]
if freqMap[pattern] doesn't exist
freqMap[pattern] = 1
else
freqMap[pattern]++
return freqMap

Once we have built the frequency table for a given text and k, we can find all frequent k-mers if we determine the maximum value in the table, and then identify the keys of the frequency table achieving this value, appending each one that we find to a growing list. We are now ready to write a function BetterFrequentWords to solve the Frequent Words Problem.

BetterFrequentWords(text, k)
frequentPatterns ← an array of strings of length 0
freqMap FrequencyTable(text, k)
max MaxMap(freqMap)
for all strings pattern in freqMap
if freqMap[pattern] = max
frequentPatterns
append(frequentPatterns, pattern)
return frequentPatterns

STOP: Why do we not need the Contains subroutine in BetterFrequentWords?

A remark on MaxMap is in order. It would be very easy for us to imagine this as a straightforward function ranging over all the keys in the dictionary and updating a maximum value m every time we find a bigger element. The following function implements this idea.

MaxMap(dict)
m ← 0
for every key pattern in dict
if dict[pattern] > m
m
dict[pattern]
return m

Yet we could imagine a map with string keys whose integer values are all negative. For such a map, we would set m equal to 0, and this value would never get updated because dict[pattern] would never be larger than m. As a result, we would erroneously return 0, rather than the true maximum value of the map.

STOP: Can you think of a modification to MaxMap that will find the maximum value of any map of string keys to integer values?

Unlike with arrays, there is no particular order to how the keys of a map are ordered. We can nevertheless resolve the issue if we introduce a Boolean variable firstTime that will be true if we have not yet considered any of the keys of dict and will be set to false as soon as we consider the first key in the map.

MaxMap(dict)
m ← 0
firstTime = true
for every key pattern in dict
if firstTime = true or dict[pattern] > m
firstTime
= false
m dict[pattern]
return m

Frequent words in Vibrio cholerae

The figure below reveals the most frequent k-mers in the ori region from Vibrio cholerae. Do any of the counts seem surprisingly large?

The most frequent k-mers in the ori region of Vibrio cholerae for k from 3 to 9, along with the number of times that each k-mer occurs.

For example, the 9-mer "ATGATCAAG" appears three times in the ori region of Vibrio cholerae — is it surprising?

   atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctcttgatcatcgtttc

We highlight a most frequent 9-mer instead of using some other value of k be- cause experiments have revealed that bacterial DnaA boxes are usually nine nucleotides long. The probability that there exists a 9-mer appearing three or more times in a randomly generated DNA string of length 500 is approximately 1/1300. In fact, there are four different 9-mers repeated three or more times in this region: "ATGATCAAG", "CTTGATCAT", "TCTTGATCA", and "CTCTTGATC".

The low likelihood of witnessing even one repeated 9-mer in the ori region of Vibrio cholerae leads us to the working hypothesis that one of these four 9-mers may represent a potential DnaA box that, when appearing multiple times in a short region, jump-starts replication. But which one?

STOP: Is any one of the four most frequent 9-mers in the ori of Vibrio cholerae “more surprising” than the others?

Some Hidden Messages are More Surprising than Others

Recall that nucleotides A and T are complements of each other, as are C and G. The figure below shows a template strand "AGTCGCATAGT" and its complementary strand "ACTATGCGACT".

Complementary strands run in opposite directions.

At this point, you may think that we have made a mistake, since the comple- mentary strand in this figure reads out "TCAGCGTATCAT from left to right rather than "ACTATGCGACT". We have not: each DNA strand has a direction (strands are read in the direction from 5' to 3'), and the complementary strand runs in the opposite direction to the template strand, as shown by the arrows in the figure above.

The reverse complement of a string pattern is the string formed by taking the complement of each nucleotide in pattern, then reversing the resulting string. The following problem is fundamental in computational biology research.

Reverse Complement Problem

Input: A DNA string pattern.

Output: The reverse complement of pattern.

We could write a single function solving the Reverse Complement Problem, but we can instead pass our work to two subroutines. Reverse will return the reverse of an input string, whereas Complement will take the string formed by complementing each nucleotide in a DNA string (but not reversing the string).

ReverseComplement(pattern)
pattern Reverse(pattern)
pattern Complement(pattern)
return pattern

This function offers an example of how modularity, or dividing code into small functions that call each other, is a powerful programming practice. Modular code will be not only easier to read but also easier to debug, since we can test functions independently of each other in order to diagnose problems. In this case, we will first note the following shorter version of the ReverseComplement function.

ReverseComplement(pattern)
return Reverse(Complement(pattern))

STOP: Write pseudocode for the Reverse and Complement functions.

Interestingly, among the four most frequent 9-mers in the ori region of Vibrio cholerae, "ATGATCAAG" and "CTTGATCAT" are reverse complements of each other, resulting in the following six occurrences of these strings.

   atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac 
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc

It should be noted that we don’t know whether "ATGATCAAG" or "CTTGATCAT" is the true “message” to which DnaA binds. If the message is "ATGATCAAG", then every occurrence of "CTTGATCAT" above is an occurrence of "ATGATCAAG" on the complementary strand of DNA. If the message is "CTTGATCAT", then every occurrence of "ATGATCAAG" above will appear as "CTTGATCAT" in the complementary strand. DnaA is not able to tell the difference between the two strands of DNA, and so it will see six occurrences of its hidden message.

Finding a 9-mer that appears six times (either as itself or as its reverse complement) in a DNA string of length 500 is far more surprising than finding a 9-mer that appears three times (as itself). This observation leads us to the working hypothesis that "ATGATCAAG" and its reverse complement "CTTGATCAT" represent a DnaA box in Vibrio cholerae.

However, before concluding that we have found the DnaA box of Vibrio cholerae, the careful bioinformatician should check if there are other short regions in the Vibrio cholerae genome exhibiting multiple occurrences of "ATGATCAAG" or "CTTGATCAT". After all, maybe these strings occur as repeats throughout the entire Vibrio cholerae genome, rather than just in the ori region. To this end, we should solve the following computational problem.

Pattern Matching Problem

Input: Strings pattern and genome.

Output: All starting positions in genome where pattern appears as a substring.

Note how similar this problem is to the Counting Words Problem. Here, rather than counting the number of occurrences of a pattern within a longer string, we are finding all the starting positions of this pattern within the string. Our function, which we call StartingIndices, should therefore range over all the k-mers of text and append any k-mers that match pattern to a growing list, which we call positions.

StartingIndices(pattern, text)
positions ← array of integers of length 0
n len(text)
k len(pattern)
for every integer i between 0 and n k
if text[i, i + k] = pattern
positions
append(positions, i)
return positions

There is a general programming principle time we write code that is very similar to what we have already written, it indicates that we very well may be able to use a subroutine instead. In this case, note that once we have the array positions storing the starting indices of all occurrences of pattern in text, we can obtain the number of pattern matches just by accessing the length of positions. As a result, we can rewrite the PatternCount function using StartingIndices as a subroutine and obtain a shorter function leveraging modularity.

PatternCount(pattern, text)
positions StartingIndices(pattern, text)
return len(positions)

After implementing the Pattern Matching Problem, we discover that "ATGATCAAG" appears 17 times in the following positions of the Vibrio cholerae genome:

116556, 149355, 151913, 152013, 152394, 186189, 194276, 200076, 224527, 307692, 479770, 610980, 653338, 679985, 768828, 878903, 985368

With the exception of the highlighted three occurrences of "ATGATCAAG" in ori at starting positions 151913, 152013, and 152394, no other instances of "ATGATCAAG" form clumps, i.e., appear close to each other in a small region of the genome. You may check that the same conclusion is reached when searching for "CTTGATCAT". We now have strong statistical evidence that "ATGATCAAG"/"CTTGATCAT" may represent the hidden message to DnaA to start replication.

STOP: Can we conclude that "ATGATCAAG"/"CTTGATCAT" also represents a DnaA box in other bacterial genomes?

An Explosion of Hidden Messages

Looking for hidden messages in multiple genomes

We should not jump to the conclusion that "ATGATCAAG"/"CTTGATCAT" is a hidden message for all bacterial genomes without first checking whether it even appears in known ori regions from other bacteria. After all, maybe the clumping effect of "ATGATCAAG"/"CTTGATCAT" in the ori region of Vibrio cholerae is simply a statistical fluke that has nothing to do with replication. Or maybe different bacteria have different DnaA boxes . . .

Let’s check the proposed ori region of Thermotoga petrophila, a bacterium that thrives in extremely hot environments; its name derives from its discovery in the water beneath oil reservoirs, where temperatures can exceed 80◦ Celsius.

   aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga
aactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaa
ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa
acaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggttt
ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca
agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat
ccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggta
agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa
cctaccacctgcgtcccctattatttactactactaataatagcagtataattgatctga

This region does not contain a single occurrence of "ATGATCAAG" or "CTTGATCAT"! Thus, different bacteria may use different DnaA boxes as “hidden messages” to the DnaA protein.

Application of the Frequent Words Problem to the ori region above reveals that the following six 9-mers appear in this region three or more times:

          "AACCTACCA"        "AAACCTACC"        "ACCTACCAC"
"CCTACCACC" "GGTAGGTTT" "TGGTAGGTT"

Something peculiar must be happening because it is extremely unlikely that six different 9-mers will occur so frequently within a short region in a random string. We will cheat a little and consult with Ori-Finder, a software tool for finding replication origins in DNA sequences. This software chooses "CCTACCACC" (along with its reverse complement "GGTGGTAGG") as a working hypothesis for the DnaA box in Thermotoga petrophila. Together, these two complementary 9-mers appear five times in the replication origin:

    aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga
aactaaaatggtaggtttGGTGGTAGGttttgtgtacattttgtagtatctgatttttaa
ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa
acaaaCCTACCACCaaactctgtattgaccattttaggacaacttcagGGTGGTAGGttt
ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca
agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat
ccaagccgatttcagagaaacctaccacttacctaccacttaCCTACCACCcgggtggta
agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa
CCTACCACCtgcgtcccctattatttactactactaataatagcagtataattgatctga

The Clump Finding Problem

We now have a good sense of how to solve our first big scientific question of identifying the hidden message in a given replication origin indicating that replication should begin. Yet it remains to answer our second, larger question, which is where this replication origin lives within a much longer bacterial genome consisting of millions of nucleotides.

Imagine that you are trying to find ori in a newly sequenced bacterial genome. Searching for “clumps” of "ATGATCAAG"/"CTTGATCAT" or "CCTACCACC"/"GGTGGTAGG" is unlikely to help, since this new genome may use a completely different hidden message! Before we lose all hope, let’s change our computational focus: instead of finding clumps of a specific k-mer, let’s try to find every k-mer that forms a clump in the genome. Hopefully, the locations of these clumps will shed light on the location of ori.

Our plan is to slide a window of fixed length L along the genome, looking for a region where a k-mer appears several times in short succession. The parameter value L = 500 reflects the typical length of ori in bacterial genomes.

More formally, given integers L and t, a k-mer pattern forms an (L, t)- clump inside a (longer) string genome if there is an interval of genome of length L in which this k-mer appears at least t times. (This definition assumes that the k-mer completely fits within the interval.) For example, "TGCA" forms a (25,3)-clump in the following string:

       gatcagcataagggtccCTGCAATGCATGACAAGCCTGCAGTtgttttac

From our previous examples of ori regions, we say that "ATGATCAAG" forms a (500,3)-clump in the Vibrio cholerae genome, and that "CCTACCACC" forms a (500,3)-clump in the Thermotoga petrophila genome. We are now ready to formulate the following problem.

Clump Finding Problem

Input: A string text, and integers k, L, and t.

Output: All distinct k-mers forming (L, t)-clumps in text.

The Clump Finding Problem is a more complex problem than we have encountered thus far, and writing a function solving it from scratch would be difficult. However, this is where modularity in writing programs is so helpful. We already have a FrequencyTable function that will produce a frequency table for a given window of a string of length L. If we apply it to a given window, then we simply need to check if there are any string keys in the table whose values are at least equal to t. We will append any such keys that we have not already seen in some other window of text to a growing list of strings. In the end, this list of strings will contain the (L, t)-clumps of text. This is handled by the following FindClumps function.

FindClumps(text, k, L, t)
patterns ← an array of strings of length 0
n len(text)
for every integer i between 0 and n L
window
text[i, i + L]
freqMap FrequencyTable(window, k)
for every key s in freqMap
if freqMap[s] ≥ t and Contains(patterns, s) = false
patterns
append(patterns, s)
return patterns

Let’s look for clumps in the Escherichia coli (E. coli) genome, the workhorse of bacterial genomics. We find hundreds of different 9-mers forming (500, 3)-clumps in the E. coli genome, and it is absolutely unclear which of these 9-mers might represent a DnaA box in the bacterium’s ori region.

At this point, an unseasoned researcher might give up, since it appears that we do not have enough information to locate ori in E. coli. But a veteran computational biologist would dig into the biological details of replication in the hope that they provide new algorithmic insights into finding ori.

A Surprising Pattern, and Some Gritty Details about Replication

Lurking biological phenomenon or statistical fluke?

The figure below reveals a surprising pattern. We have partitioned the E. coli genome into 46 equally sized windows of approximately 100,000 nucleotides, starting at the experimentally verified terminus of replication, and then computed the frequency of cytosine in each window. For the first half of the genome, most windows have a high cytosine frequency (significantly above 25%), whereas most windows after we reach ori have a lower cytosine frequency (significantly below 25%).

The frequency of cytosine in each of 46 equal-length disjoint windows (of approximately 100,000 nucleotides each) covering the E. coli genome. The replication terminus is located at position 0, whereas the replication origin site is located approximately 2.3 million nucleotides away, halfway down the genome.

In contrast, as shown in the figure below, the picture is reversed for guanine frequency. After we pass ori, most windows have high guanine frequency.

The frequency of guanine in the same 46 windows in the E. coli genome. Half of the genome has windows that tend to have low guanine frequency, and half of the genome has windows with high frequency. However, the picture is reversed from the preceding figure and the consideration of cytosine.

We obtain an even more striking visualization if we take the difference of the frequencies of guanine and cytosine in each window, as shown in the figure below.

The difference between the frequencies of guanine and cytosine across the 46 windows of the E. coli genome, assuming that the genome begins at the experimentally verified replication terminus of E. coli.

Even if we assume that we do not know the location of ori in advance, the pattern still presents itself when starting at an arbitrary position of the E. coli genome, as shown in the figure below.

The difference between the frequencies of guanine and cytosine across the 46 windows of the E. coli genome, starting at an arbitrary location of the genome. We can still infer the location of the replication origin as the point at which the guanine-cytosine difference passes from negative to positive, which in the figure above is approximately 3.9 million nucleotides into the genome file.

It seems that we have uncovered a hint about how to find ori — we can simply walk along the genome and check where the difference between the frequency of guanine and cytosine switches from negative to positive! But why in the world would such a simple test allow us to find the replication origin of a bacterium?

The Simplest Way to Replicate a Genome

We are now ready to discuss the replication process in more detail. As illustrated in the figure below, the two complementary DNA strands running in opposite directions around a circular chromosome unravel, starting at ori. As the strands unwind, they create two replication forks, which expand in both directions around the chromosome until the strands completely separate at the replication terminus (denoted ter). The replication terminus is located roughly opposite to ori in the chromosome.

Four imaginary DNA polymerases at work replicating a chromosome as the replication forks extend from ori to ter. The blue strand is directed clockwise, and the green strand is directed counterclockwise.

An important thing to know about replication is that a DNA polymerase does not wait for the two parent strands to completely separate before initiating replication; instead, it starts copying while the strands are unraveling. Thus, just four DNA polymerases, each responsible for one half-strand, can all start at ori and replicate the entire chromosome. To start replication, a DNA polymerase needs a primer, a short complementary segment (shown in red in the figure above) that binds to the parent strand and jump starts the DNA polymerase. After the strands start separating, each of the four DNA polymerases starts replication by adding nucleotides, beginning with the primer and proceeding around the chromosome from ori to ter in either the clockwise or counterclockwise direction. When all four DNA polymerases have reached ter, the chromosome’s DNA will have been completely replicated, resulting in two pairs of complementary strands as shown in the figure below, and the cell is ready to divide.

Replication is complete.

Yet while you were reading the description above, biology professors were writing a petition to have me sent back to Biology 101. And they would be right, because our exposition suffers from a major flaw.

The problem with our current description is that it assumes that DNA polymerases can copy DNA in either direction along a strand of DNA (i.e., both 5′ → 3′ and 3′ → 5′). However, nature has not yet equipped DNA polymerases with this ability, as they are unidirectional, meaning that they can only traverse a template strand of DNA in the 3′ → 5′ direction. Notice that this is opposite from the 5′ → 3′ direction in which DNA is read.

STOP: If you were a unidirectional DNA polymerase, how would you replicate DNA? How many DNA polymerases would be needed to complete this task?

The unidirectionality of DNA polymerase requires a major revision to our naive model of replication. Imagine that you decided to walk along DNA from ori to ter. There are four different half-strands of parent DNA connecting ori to ter, as highlighted in the figure below. Two of these half-strands are traversed from ori to ter in the 5′ → 3′ direction and are called lagging half-strands (represented by thin blue and green lines in the figure below). The other two half-strands are traversed from ori to ter in the 3′ → 5′ direction and are called leading half-strands (represented by thick blue and green lines in the figure below).

Complementary DNA strands with lagging and leading half-strands shown as thin and thick lines, respectively.

Asymmetry of Replication

While biologists will feel at home with the following description of DNA replication, computer scientists may find it overloaded with new terms. If it seems too biologically complex, then feel free to skim this section, as long as you believe us that the replication process is asymmetric, i.e., that leading and lagging half-strands have very different fates with respect to replication.

Since a DNA polymerase can only move in the reverse (3′ → 5′) direction, it can copy nucleotides non-stop from ori to ter along leading half-strands. However, replication on lagging half-strands is very different because a DNA polymerase cannot move in the forward (5′ → 3′) direction; on these half-strands, a DNA polymerase must replicate backwards toward ori. Take a look at the figure below to see why this must be the case.

Replication begins at ori (primers shown in red) with the synthesis of fragments on the leading half-strands (shown by thick lines). A DNA polymerase must wait until the replication fork has opened some (small) distance before it starts copying the lagging half-strands (shown by thin lines) back toward ori.

On a lagging half-strand, in order to replicate DNA, a DNA polymerase must wait for the replication fork to open a little (approximately 2,000 nucleotides) until a new primer is formed at the end of the replication fork; afterwards, the DNA polymerase starts replicating a small chunk of DNA starting from this primer and moving backward in the direction of ori. When the two DNA polymerases on lagging half-strands reach ori, we have the situation shown in the figure below. Note the contrast between this figure and our original (incorrect) picture of the replication process.

The daughter fragments are now synthesized (after some delay) on the lagging half-strands (shown by thin lines).

After this point, replication on each leading half-strand progresses continuously; however, a DNA polymerase on a lagging half-strand has no choice but to wait again until the replication fork has opened another 2,000 nucleotides or so. It then requires a new primer to begin synthesizing another fragment back toward ori. On the whole, replication on a lagging half-strand requires occasional stopping and restarting, which results in the synthesis of short Okazaki fragments that are complementary to intervals on the lagging half-strand. You can see these fragments forming in the figure below.

The replication fork continues growing. Only one primer is needed to replicate the leading parent half-strands (shown by thick lines), while the lagging half-strands (shown by thin lines) require multiple primers in order to synthesize Okazaki fragments. Two of these primers are shown in red on each lagging half-strand.

When the replication fork reaches ter, the replication process is almost complete, but gaps still remain between the disconnected Okazaki fragments, as shown in the figure below.

Replication is nearly complete, as all daughter DNA is synthesized. However, half of each daughter chromosome contains disconnected Okazaki fragments.

Finally, consecutive Okazaki fragments are sewn together by an enzyme called DNA ligase, resulting in two intact daughter chromosomes, each consisting of one parent strand and one newly synthesized daughter strand, as shown in the figure below. In reality, DNA ligase does not wait until after all the Okazaki fragments have been replicated to start sewing them together.

Okazaki fragments have been sewn together, resulting in two intact chromosomes.

Deamination

We have seen that as the replication fork expands, DNA polymerase synthesizes DNA quickly on the leading half-strand but suffers delays on the lagging half-strand. Notice that since the replication of a leading half-strand proceeds quickly, it lives double-stranded for most of its life. Conversely, a lagging half-strand spends a much larger amount of its life single-stranded, waiting to be used as a template for replication. This discrepancy between the leading and lagging half-strands is important because single-stranded DNA has a much higher mutation rate than double-stranded DNA. In particular, if one of the four nucleotides in single-stranded DNA has a greater tendency than other nucleotides to mutate in single-stranded DNA, then we should observe a shortage of this nucleotide on the lagging half-strand. It might also explain the strange phenomenon that we saw earlier in this section with respect to guanine and cytosine frequencies.

Following up on this thought, let’s examine the nucleotide counts of the leading and lagging half-strands. The nucleotide counts for Thermotoga petrophila are shown in the figure below. Although the frequencies of A and T are practically identical on the two half-strands, C is more frequent on the leading half-strand than on the lagging half-strand, resulting in a difference of 219518 − 207901 = +11617. Its complementary nucleotide G is less frequent on the leading half-strand than on the lagging half-strand, resulting in a difference of 201634 − 211607 = −9973.

Counting nucleotides in the Thermotoga petrophila genome on the leading and lagging half-strands.

It turns out that we observe these discrepancies because cytosine(C) has a tendency to mutate into thymine (T) through a process called deamination. Deamination rates rise 100-fold when DNA is single-stranded, which leads to a decrease in cytosine on the lagging half-strand, thus forming mismatched base pairs T-G as shown in the figure below. These mismatched pairs can further mutate into T-A pairs when the bond is repaired in the next round of replication, which accounts for the observed decrease in guanine (G) on the leading half-strand (recall that a lagging parent half-strand synthesizes a leading daughter half-strand, and vice-versa).

A mutation of a cytosine nucleotide to a thymine nucleotide on a lagging half-strand will cause DNA polymerase to pair the new thymine with an adenine, thus reducing the amount of guanine present on the leading half-strand as well in the daughter chromosome.

STOP: If deamination changes cytosine to thymine, why do you think that the lagging half-strand still has some cytosine?

Half of the daughter DNA will therefore have a shortage of C on the lagging half-strand and a shortage of G on the leading half-strand. Over time, mutations like this will build up in the population, and produce the disparities that we saw previously in the counts of cytosine and guanine on opposing half-strands.

The Skew Diagram

Let’s see if we can take advantage of the peculiar statistics caused by deamination to locate ori even more accurately than the approach that we illustrated previously. As the figure below shows, the difference between the total amount of guanine and the total amount of cytosine is negative on the leading half-strand (201634 − 219518 = −17884) and positive on the lagging half-strand (211607 − 207901 = 3706). Thus, our idea is to traverse the genome, keeping a running total of the difference between the counts of G and C. If this difference starts increasing, then we guess that we are on the lagging half-strand; on the other hand, if this difference starts decreasing, then we guess that we are on the leading half-strand. See the figure below.

Because of deamination, each lagging half-strand has a shortage of cytosine compared to guanine, and each leading half-strand has a shortage of guanine compared to cytosine. The dashed blue line illustrates an imaginary walk along the outer strand of the genome counting the difference between the counts of guanine and cytosine. We assume that the difference between these counts is positive on the lagging half-strand and negative on the leading half-strand.

STOP: Imagine that you are reading through the genome (in the 5′ → 3′ direction) and notice that the difference between the guanine and cytosine counts just switched its behavior from decreasing to increasing. Where in the genome are you?

Since we don’t know a priori the location of ori in a circular genome, we will have a linear string genome where the position of ori is not indicated. For a genome of length n, define the skew array of genome as the array of length n+1 whose i-th value is the difference between the total number of occurrences of G and the total number of occurrences of C in the first i nucleotides of genome. This leads us to the following computational problem. You may like to practice your pseudocode skills by writing a function to solve this problem.

Skew Array Problem

Input: A DNA string genome.

Output: The skew array of genome.

We can visualize a skew array with a chart called a skew diagram. The skew diagram is a line graph that plots the number of nucleotides k encountered in genome on the x-axis, and the k-th value of the skew array on the y-axis. A skew diagram for a short DNA string is shown in the figure below.

The skew diagram for genome = CATGGGCATCGGCCATACGCC.

The above skew diagram provides an insight, which is that the difference between the number of G and the number of C encountered in the first i nucleotides of genome differs by at most 1 from this difference when considering the first i–1 nucleotides of genome. This insight leads to a solution to the Skew Array Problem below, called SkewArray. This function employs a subroutine Skew that takes a symbol (A, C, G, or T) as input and returns 1 if the symbol is G, -1 if the symbol is C, and 0 otherwise.

SkewArray(genome)
nlen(genome)
array ← array of length n+1
array[0] ← 0
for every integer i between 1 and n
array[i] = array[i-1] + Skew(genome[i-1])
return array
Skew(symbol)
if symbol = 'G'
return 1
else if symbol = 'C'
return -1
return 0

The figure below depicts the skew diagram for a linearized E. coli genome. Notice the very clear V-shaped pattern! It turns out that the skew diagram for many bacterial genomes has a similar characteristic shape.

The skew diagram for E. coli achieves a maximum and minimum at positions 1550413 and 3923620, respectively.

STOP: After looking at the skew diagram in the figure above, where do you think that ori is located in E. coli?

Let’s follow the 5′ → 3′ direction of DNA and walk along the chromosome from ter to ori (along a leading half-strand), then continue on from ori to ter (along a lagging half-strand). We have already reasoned that the skew should generally be decreasing along the leading half-strand and increasing along the lagging half-strand. Thus, the skew should achieve a minimum at the position where the reverse half-strand ends and the forward half-strand begins, which is exactly the location of ori! We have just developed an algorithm for locating ori: it should be found where the skew attains a minimum. To be precise, we can state this as a computational problem. Again, you may want to try solving this problem before looking at our pseudocode.

Minimum Skew Problem

Input: A DNA string genome.

Output: All indices i of the skew array of genome achieving the minimum value in the skew array.

To solve the Minimum Skew Problem, we employ SkewArray as a subroutine along with a subroutine MinValue that finds the minimum value in an array of integers, which we leave to you.

MinimumSkew(genome)
indices ← array of integers of length 0
nlen(genome)
array SkewArray(genome)
mMinValue(array)
for every integer i between 0 and n
if array[i] = m
indices = append(indices, i)
return indices

STOP: Note that the skew diagram changes depending on where we start our walk along the circular chromosome. Do you think that the minimum of the skew diagram points to the same position in the genome regardless of where we begin walking to generate the skew diagram?

Some Hidden Messages are More Elusive than Others

Solving the Minimum Skew Problem now provides us with an approximate location of ori at position 3923620 in E. coli. It is remarkable that such a simple analysis of the nucleotide frequencies in a genome would lead us to such a precise biological hypothesis.

In an attempt to confirm this hypothesis, let’s look for a hidden message representing a potential DnaA box near this location. Solving the Frequent Words Problem in a window of length 500 starting at position 3923620 (shown below) reveals no 9-mers (along with their reverse complements) that appear three or more times! Even if we have located ori in E. coli, it appears that we still have not found the DnaA boxes that jump-start replication in this bacterium …

   aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggt
atgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaaga
cctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgat
ctcttattaggatcgcactgccctgtggataacaaggatccggcttttaagatcaacaac
ctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag
aatgaggggttatacacaactcaaaaactgaacaacagttgttctttggataactaccgg
ttgatccaagcttcctgacagagttatccacagtagatcgcacgatctgtatacttattt
gagtaaattaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatc
aagaatgttgatcttcagtg

Before we give up, let’s examine the ori of Vibrio cholerae one more time to see if it provides us with any insights on how to alter our algorithm to find DnaA boxes in E. coli. You may have noticed that in addition to the three occurrences of "ATGATCAAG" and three occurrences of its reverse complement "CTTGATCAT", the Vibrio cholerae ori contains additional occurrences of "ATGATCAAC" and "CATGATCAT", which differ from "ATGATCAAG" and "CTTGATCAT" in only a single nucleotide:

   atcaATGATCAACgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagCATGATCATggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc

Finding eight approximate occurrences of our target 9-mer and its reverse complement in a short region is even more statistically surprising than finding the six exact occurrences of "ATGATCAAG" and its reverse complement "CTTGATCAT" that we stumbled upon in the beginning of our investigation. Furthermore, the discovery of these approximate 9-mers makes sense biologically, since DnaA can bind not only to “perfect” DnaA boxes but to their slight variations as well.

Let’s cross our fingers and identify the most frequent 9-mers (with 1 mismatch and reverse complements) within a window of length 500 starting at position 3923620 of the E. coli genome. Bingo! The experimentally confirmed DnaA box in E. coli ("TTATCCACA") is a most frequent 9-mer with 1 mismatch, along with its reverse complement "TGTGGATAA":

   aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggt
atgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaaga
cctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgat
ctcttattaggatcgcactgcccTGTGGATAAcaaggatccggcttttaagatcaacaac
ctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag
aatgaggggTTATACACAactcaaaaactgaacaacagttgttcTTTGGATAActaccgg
ttgatccaagcttcctgacagagTTATCCACAgtagatcgcacgatctgtatacttattt
gagtaaattaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatc
aagaatgttgatcttcagtg

We were fortunate that the DnaA boxes of E. coli are captured in the window that we chose. Moreover, while "TTATCCACA" represents a most frequent 9-mer with 1 mismatch and reverse complements in this 500-nucleotide window, it is not the only one: "GGATCCTGG", "GATCCCAGC", "GTTATCCAC", "AGCTGGGAT", and "CTGGGATCA" also appear four times with 1 mismatch and reverse complements.

STOP: In this chapter, every time we find ori, we seem to find some other surprisingly frequent 9-mers. Why do you think this is?

We do not know what purpose — if any — these other 9-mers serve in the E. coli genome, but we do know that there are many different types of hidden messages in genomes. These hidden messages have a tendency to cluster within a genome, and most of them have nothing to do with replication. However, even providing biologists with a small collection of 9-mers as candidate DnaA boxes is a great aid as long as one of these 9-mers is correct.

The moral is that existing approaches to ori prediction remain imperfect and sometimes inconclusive. Furthermore, even though computational predictions can be powerful, computational biologists should still collaborate with experimental biologists where possible. Yet at the same time, it is clear that a revolution has struck biology. In an era in which we are bombarded by large-scale biological datasets, a computational revolution is very much in progress and is already answering the big unresolved questions of this field.

Epilogue

Thank you for reading this far! You can find more information about Programming for Lovers at the course homepage on Canvas: https://canvas.instructure.com/enroll/RKFKKP.

Below, I will provide a lecture demonstrating how we can implement the major algorithms from this section in Go while learning about how Go implements arrays, slices, strings, and maps.

Finally, I must acknowledge that this chapter grew from previous work with Pavel Pevzner, with whom I am the co-author of Bioinformatics Algorithms: An Active Learning Approach (http://bioinformaticsalgorithms.com). If you enjoyed this chapter, I think you would love Pavel and my textbook or the online courses that it accompanies.

--

--

Phillip Compeau
Programming for Lovers

Associate Teaching Professor and Asst Dept Head for Education in the Computational Biology Department in Carnegie Mellon University’s School of Computer Science