Sequence to Sequence Prediction (seq2seq) for DNA, Proteins Analysis

7 min readJun 30, 2017

Sequence to Sequence Prediction (seq2seq) as Deep Learning is making waves in various fields such as Computer Vision, Language Processing. In this article, let me introduce Graph Theory based seq2seq technology created by NaturalText to change the way bio-sequences such as Gene, Protein and Nucleotides are analyzed in Computational Biology.

Why seq2seq for Computational Biology?

The Deoxy Ribonucleic Acid commonly known as the DNA is the foundation on which life exists. With just 4 nucleobases Adenine (A), Guanine (G), Cytosine © and Thymine (T), it constructs billions of unique sequences. The analysis of these sequences has been a subject of fundamental research ever since the discovery of the double helix, with various techniques and tools being developed over these decades.

An innovative step towards analyzing this would be looking at the DNA as a language — alphabets (AGCT) + grammatical rules. This point of view opens the space to using Natural Language Understanding, an area of research as old as DNA research but highly advanced in today’s world in the analysis of DNA sequences.

How hard?

The number of ways in which the four nucleobases ‘ACGT’ arrange themselves are given below.

The four letters, ‘ACGT’ can be arranged in various ways.

a) Two letters sequences can be arranged in 16 ways,AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT

b) Three letters sequences can be arranged in 64 ways,AAA, AAC.. ACA.. AGA.. CAA.. CGA.. CCC.. GCA... TTT

c) Four letter sequences can be arranged in 256 ways, AAAA, AAAC.. ACAA.. AGAA.. CAAC.. CGCA.. CCCC.. GCAC... TTTT

This can be written as 4, 4^2, 4^3, 4^4, 4^5.... 4^n

Thus the total number of possible combinations that can occur in a 100 length sequence can occur? 4^100= 1.6x10^60

For comparison, the observable universe has 10^82 atoms whereas a 200 length sequence of DNA has 10^120 possible combinations. This is equivalent to the number of possible games in Chess. Imagine, the number of possible combinations for a 300 Million length sequence!

This shows how hard to find a match, if we try to use brute-force method. However, we can assume that, there is particular structure hidden in DNA that makes up the entire living world.

How is it approached currently?

Presently popular tools such as BLAST, Clustal Omega handle this challenge by using statistical methods and dynamic programming. For Gene comparison and genome assembly, methods such as de Bruijn graph and Burrows–Wheeler aligner are available. These techniques are computationally complex and thus require high performance computing facilities. The degree of complexity is discussed in the subsequent section.

How to apply de Bruijn graphs to genome assembly
Fast and accurate short read alignment with Burrows–Wheeler transform

Computational complexity

Alignment of multiple sequences is NP-Hard or NP-complete depending upon the task performed. It is discussed in these papers.

Computational complexity of multiple sequence alignment with SP-score.
On the Complexity of Multiple Sequence Alignment.

Gist of the complexity can be found in Dynamic programming and computational complexity

How NaturalText’s seq2seq can solve this?

NaturalText’s seq2seq innovative algorithm handles this challenge by extracting patterns using graphical reasoning. The exact method of solving is proprietary. The algorithm’s effectiveness is based on two main principles, search space reduction and Fail fast.

Search Space Reduction

The basic idea here is to shorten the length of the search string to which a test string is to compared with. For example, comparing a 100 length string in 1000 length string, would be faster than comparing a 100 length string in 300 Million length string. Even by the brute-force method, this will result in 300,000 times increase in performance.

Narrowing down the search to only in regions that differ, would give the same performance increase in comparing large sequences or multiple large sequences. When comparing large sequences such as genomes, this method can do magic. If only small regions of genes differ, then that regions alone can be compared and analyzed.

Fail Fast, Fail Faster, Finish Fastest

Reducing the search space is nice strategy but what about worst case performance? Well, that is where fail fast method comes in. If there is no likelihood of finding motifs, the process will be stopped immediately. This ensures that if and when regions are compared, then it must have 100% chance of finding a match.

The 100% match advantage

Achieving a 100% match using the available tools has been elusive owing to the NP-Hard problem. This leads to problems such as gaps in whole genome sequencing data. NaturalText’s seq2seq can match 100% to the detail of single letter. This advantage comes from not running to NP-Hard problem, ie seq2seq doesn’t face NP-Hard problem. Idea is to bypass NP-Hard problem instead of solving it.

This has potential to change genomics research by reducing the cost and increasing the accuracy.

Seq2seq Patterns no Statistics

Statistical data is the most preferred way of data processing and analysis in science. However, when thrown into densely compacted information of DNA or proteins, the usage of statistical techniques is cumbersome with the accuracy of the output too taking a hit. By using well defined graphical patterns instead of the regular statistical calculation, the seq2seq algorithm betters the former in terms of greater speed in processing data and also 100% accuracy in finding the match in the output.

Graphical Reasoning, not Deep Learning

Even though Deep Learning methods do the sequence to sequence prediction, Natural Text’s methods are not Deep Learning. It is based on Graph Theory. The decision making, discriminator in Deep Learning , is done by reasoning using Graph Theory. Hence, it is Artificial Reasoning instead of Artificial Intelligence. Intelligence is a box term where as reasoning refers to applying logical rules.

Where is Variants, Indels etc?

This is a mathematical solution by treating bio-sequences as strings. Variants and other concepts can be applied using Artificial Reasoning.

Comparing Millions of short reads in bulk for Genome assembly

Using seq2seq to analyse genome assembly, its innovative algorithm that breaks down the search to comparing millions of short reads, thus reducing the analysis time by a factor of 10. For example, if three 150 short reads are 99% similar, there is 100% chance that all three short reads occur in a same region.

Instead of finding that region for three times, that particular motif area can be found once and three reads can be matched at once. For a million reads, reads can be grouped and assembled. Million reads can be categorized into 5000 categories, thus 200 times performance improvement from single comparison.

If NGS results are in the order of millions, instead of comparing short reads one by one, it can be compared in bulk thus saving time and costs.

Instead of following the familiar path of having a reference genome to match, short reads can be compared, eliminating another round of verification and validation.

Storing in Graph Database

Storing the bio-sequences in graph database or any database is hard because the distribution is highly dense. Just 4 nodes and millions of edges. However, Storing extracted motifs is easy. A gene can be reduced to 30 million motifs which can stored in a graph database and analyzed.

However, if 30 million patterns for one gene, with 23 genes, 690 million patterns for all the genes, this just for one set. If there is small differences then this count can go up for millions, if we consider sequencing millions of people. Would any database can handle this load? Given billion patterns, how to find and compare patterns faster?

High Throughput Database

A high throughput database handling half billion nodes and eight billion edges with response time in the range of milliseconds can handle even trillions of nodes in distributed environment.

This database is developed by NaturalText to handle big data.

Performance Details.

Gene assembly

Gene taken for comparison :gi|568815597|ref|NC_000001.11| Homo sapiens chromosome 1, GRCh38.p7 Primary AssemblyNumber of Nucleotide short reads : 1 Million

System :

Processor: AMD FX(tm)-6300 Six-Core Processor
Memory : 24GB (DDR2 8GB *3)
Time to process ref Gene : 20 Minutes
Time to compare Million Nucleotide : 32 Minutes
Max Memory : 10 GB
Threads : Single Thread

This time is in low cost, low end machine. Time reduces into half if processed in i7 processor.

Processor : Intel(R) Core(TM) i7-4770K CPU
Memory : 16 GB (DDR3 8GB * 2)
Time to process ref Gene : 9 Minutes
Time to compare Million Nucleotide : 14 Minutes
Max Memory : 10 GB
Threads : Single Thread

Check the NaturalText Genome Alignment and Search Demo

If processing ref gene is omitted, it can be stored in database and loaded in less than min, total time taken to process million Nucleotide is 4x faster than current methods which is an hour in high end machine with 256 GB Memory. Only Single Thread is used to make comparisons easy. Open-Source tools such as Burrows Wheeler Aligner takes one hour in a single thread.

For high end machines, running cost is also high apart from the initial investments. This makes not only treatments such as precision medicine costly and not accessible to masses because of high cost.

Nucleotide Comparison

Processor:	AMD FX(tm)-6300 Six-Core Processor
Memory : 24GB (DDR2 8GB *3)
Time to process million Nucleotide : 2 Minutes
Time to compare Million Nucleotide : 10 Minutes
Max Memory : 10 GB
Threads : Single Thread

Check the NaturalText Nucleotide Alignment and Search Demo

API Access and OnDemand Analysis

Another remarkable advantage of this approach is that, it can be accessed via API and on demand analysis is possible. Further, what if there is a new sequence or a need to find patterns in another set of data? The seq2seq algorithm has been developed to handle this additional complexity by having the capability to scale both vertically (more data, more analysis) and horizontally (different datasets, different comparisons)

How that is possible? Because this sequence alignment technology has been created as a general solution for sequence prediction in various areas such as Natural Languages.