De Bruijn Graph: a Genome

Note that this is part of course note taken from this coursera course on DNA Algorithm.

The problem: In genome assembly, we are given a set of reads that are substrings of original sequence, and our goal is to re-construct the original sequence. E.g. given ATCG, TCGC, GCAA => ATCGCGCAA

One way to go about it is to assume that we have every single k-mer (reads of length k) taken from a read. E.g. taking all 4-mers of ATCGCGCAA yields ATCG, TCGC, CGCG, GCGC, CGCA, GCAA. Of course, in real world, there is no guarantee that the assumption holds, but this is a good starting point.

The De Bruijn Graph is a convenient mean of reconstructing the original sequence. To construct such a graph we have the following steps:

Step1: take all (k-1)-mers from the set of k-mers, e.g. ATCG, TCGC -> ATC, TCG, TCG, CGC. We should have double the size of k-mer reads.

Step2: construct a multi-graph with nodes being k-1-mers; draw an edge between two k-1 mers only if the two k-1 mers are taken from the same read. E.g. ATCG, TCGC gives

De Bruijn Graph built from 2 k-mer: ATCG, TCGC

Step3: the graph constructed this way is guaranteed to have a Eulerian trail, we follow the trail and connect the nodes to form our original sequence.

De Bruijn Graph built from 4 mers of ATCGCGCAA, we can follow an Eulerian trail to reconstruct the original sequence

More details: why does the algorithm work? The second k-1 mer of every k-mer is the start of the k-1 mer in the next k-mer, continuing this way to the end of sequence.

When would the algorithm fail? when we have repeats in the sequence. For example, the following is the graph generated by taking all 3-mers of ATGATATG.

De Bruijn Graph from 3mers of sequence ATGATATG. Note the many AT repeats

In this case, we could construct TGATGATG out of the graph by taking an Eulerian tour (circuit to be more accurate), which is different from the original sequence. To overcome the ambiguity from short repeats, we may increase our read length so that we have more distinct k-1 mers.

Towards a real assembly algorithm

We have seen an algorithm to assemble k-mer reads. We can loosen the condition to accept reads above a given length, and to break each k-mer to (k-n)-mers to account for reads that have k-n overlap, instead of k-1 overlap.

Also, we can try reconstructing parts that are easier to assemble (contigs) and leave out parts that are ambiguous. For example, ATGGCCGAAGCAACGGGTTTGTGACGCGCGCGCGCGACCAGG has a long CG repeats in the middle, but the two ends are fairly easy to assemble.

CGG->GGG->GGT->.. are easy to assemble, we can leave out the ambiguous parts

Code Snippet here:

# coding: utf-8
# In[2]:
def debruijnize(reads):
nodes = set()
not_starts = set()
edges = []
for r in reads:
r1 = r[:-1]
r2 = r[1:]
nodes.add(r1)
nodes.add(r2)
edges.append((r1,r2))
not_starts.add(r2)
return (nodes,edges,list(nodes-not_starts))
# In[3]:
def build_k_mer(str,k):
return [str[i:k+i] for i in range(0,len(str)-k+1)]
# In[4]:
reads = build_k_mer("ATCGTTGCGCGACCG",4)
print(reads)
# In[5]:
G = debruijnize(reads)
print(G)
# In[6]:
def make_node_edge_map(edges):
node_edge_map = {}
for e in edges:
n = e[0]
if n in node_edge_map:
node_edge_map[n].append(e[1])
else:
node_edge_map[n] = [e[1]]
return node_edge_map
m = make_node_edge_map(G[1])
print(m)
# In[7]:
def eulerian_trail(m,v):
nemap = m
result_trail = []
start = v
result_trail.append(start)
while(True):
trail = []
previous = start
while(True):

if(previous not in nemap):
break
next = nemap[previous].pop()
if(len(nemap[previous]) == 0):
nemap.pop(previous,None)
trail.append(next)
if(next == start):
break;
previous = next
# completed one trail
print(trail)
index = result_trail.index(start)
result_trail = result_trail[0:index+1] + trail + result_trail[index+1:len(result_trail)]
# choose new start
if(len(nemap)==0):
break
found_new_start = False
for n in result_trail:
if n in nemap:
start = n
found_new_start = True
break # from for loop
if not found_new_start:
print("error")
print("result_trail",result_trail)
print(nemap)
break
return result_trail
start = G[2][0] if (len(G[2]) > 0) else G[0][0]
print m
t = eulerian_trail(m,start)
print(t)
# In[8]:
def visualize_debruijn(G):
nodes = G[0]
edges = G[1]
dot_str= 'digraph "DeBruijn graph" {\n '
for node in nodes:
dot_str += ' %s [label="%s"] ;\n' %(node,node)
for src,dst in edges:
dot_str += ' %s->%s;\n' %(src,dst)
return dot_str + '}\n'
# In[9]:
get_ipython().magic(u'load_ext gvmagic')
# In[10]:
get_ipython().magic(u'dotstr visualize_debruijn(G)')
# In[11]:
def assemble_trail(trail):
if len(trail) == 0:
return ""
result = trail[0][:-1]
for node in trail:
result += node[-1]
return result
assemble_trail(t)
# In[12]:
def test_assembly_debruijn(t,k):
reads = build_k_mer(t,k)
G = debruijnize(reads)
v = visualize_debruijn(G)
nemap = make_node_edge_map(G[1])
print(G)
print(v)
start = next(iter(G[2])) if (len(G[2]) > 0) else next(iter(G[0]))
trail = eulerian_trail(nemap,start)
return assemble_trail(trail)
# In[17]:
test_assembly_debruijn("ATGGCCGAAGCAACGGGTTTGTGACGCGCGCGCGCGACCAGG",4)
Like what you read? Give hanchen a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.