# De Bruijn Graph: a Genome

Note that this is part of course note taken from this coursera course on DNA Algorithm.

**The problem:** In genome assembly, we are given a set of reads that are substrings of original sequence, and our goal is to re-construct the original sequence. E.g. given ATCG, TCGC, GCAA => ATCGCGCAA

One way to go about it is to assume that we have every single *k-mer* (reads of length k) taken from a read. E.g. taking all 4-mers of ATCGCGCAA yields ATCG, TCGC, CGCG, GCGC, CGCA, GCAA. Of course, in real world, there is no guarantee that the assumption holds, but this is a good starting point.

The **De Bruijn Graph** is a convenient mean of reconstructing the original sequence. To construct such a graph we have the following steps:

*Step1*: take all (k-1)-mers from the set of k-mers, e.g. ATCG, TCGC -> ATC, TCG, TCG, CGC. We should have double the size of k-mer reads.

*Step2*: construct a multi-graph with nodes being k-1-mers; draw an edge between two k-1 mers only if the two k-1 mers are taken from the same read. E.g. ATCG, TCGC gives

Step3: the graph constructed this way is guaranteed to have a Eulerian trail, we follow the trail and connect the nodes to form our original sequence.

**More details**: why does the algorithm work? The second k-1 mer of every k-mer is the start of the k-1 mer in the next k-mer, continuing this way to the end of sequence.

When would the algorithm fail? when we have repeats in the sequence. For example, the following is the graph generated by taking all 3-mers of ATGATATG.

In this case, we could construct TGATGATG out of the graph by taking an Eulerian tour (circuit to be more accurate), which is different from the original sequence. To overcome the ambiguity from short repeats, we may increase our read length so that we have more distinct k-1 mers.

**Towards a real assembly algorithm**

We have seen an algorithm to assemble k-mer reads. We can loosen the condition to accept reads above a given length, and to break each k-mer to (k-n)-mers to account for reads that have k-n overlap, instead of k-1 overlap.

Also, we can try reconstructing parts that are easier to assemble (contigs) and leave out parts that are ambiguous. For example, ATGGCCGAAGCAACGGGTTTGTGACGCGCGCGCGCGACCAGG has a long CG repeats in the middle, but the two ends are fairly easy to assemble.

**Code Snippet here:**

# coding: utf-8

# In[2]:

def debruijnize(reads):

nodes = set()

not_starts = set()

edges = []

for r in reads:

r1 = r[:-1]

r2 = r[1:]

nodes.add(r1)

nodes.add(r2)

edges.append((r1,r2))

not_starts.add(r2)

return (nodes,edges,list(nodes-not_starts))

# In[3]:

def build_k_mer(str,k):

return [str[i:k+i] for i in range(0,len(str)-k+1)]

# In[4]:

reads = build_k_mer("ATCGTTGCGCGACCG",4)

print(reads)

# In[5]:

G = debruijnize(reads)

print(G)

# In[6]:

def make_node_edge_map(edges):

node_edge_map = {}

for e in edges:

n = e[0]

if n in node_edge_map:

node_edge_map[n].append(e[1])

else:

node_edge_map[n] = [e[1]]

return node_edge_map

m = make_node_edge_map(G[1])

print(m)

# In[7]:

def eulerian_trail(m,v):

nemap = m

result_trail = []

start = v

result_trail.append(start)

while(True):

trail = []

previous = start

while(True):

if(previous not in nemap):

break

next = nemap[previous].pop()

if(len(nemap[previous]) == 0):

nemap.pop(previous,None)

trail.append(next)

if(next == start):

break;

previous = next

# completed one trail

print(trail)

index = result_trail.index(start)

result_trail = result_trail[0:index+1] + trail + result_trail[index+1:len(result_trail)]

# choose new start

if(len(nemap)==0):

break

found_new_start = False

for n in result_trail:

if n in nemap:

start = n

found_new_start = True

break # from for loop

if not found_new_start:

print("error")

print("result_trail",result_trail)

print(nemap)

break

return result_trail

start = G[2][0] if (len(G[2]) > 0) else G[0][0]

print m

t = eulerian_trail(m,start)

print(t)

# In[8]:

def visualize_debruijn(G):

nodes = G[0]

edges = G[1]

dot_str= 'digraph "DeBruijn graph" {\n '

for node in nodes:

dot_str += ' %s [label="%s"] ;\n' %(node,node)

for src,dst in edges:

dot_str += ' %s->%s;\n' %(src,dst)

return dot_str + '}\n'

# In[9]:

get_ipython().magic(u'load_ext gvmagic')

# In[10]:

get_ipython().magic(u'dotstr visualize_debruijn(G)')

# In[11]:

def assemble_trail(trail):

if len(trail) == 0:

return ""

result = trail[0][:-1]

for node in trail:

result += node[-1]

return result

assemble_trail(t)

# In[12]:

def test_assembly_debruijn(t,k):

reads = build_k_mer(t,k)

G = debruijnize(reads)

v = visualize_debruijn(G)

nemap = make_node_edge_map(G[1])

print(G)

print(v)

start = next(iter(G[2])) if (len(G[2]) > 0) else next(iter(G[0]))

trail = eulerian_trail(nemap,start)

return assemble_trail(trail)

# In[17]:

test_assembly_debruijn("ATGGCCGAAGCAACGGGTTTGTGACGCGCGCGCGCGACCAGG",4)