Protein Alignment and Search by Graphical Models

Finding similar proteins or protein sequences by aligning has many applications in biological field.

In computational biology, there are three algorithms used to find sequences using dynamic programming. These are Needleman–Wunsch algorithm , Smith–Waterman algorithm, Hirschberg’s algorithm .

All these algorithms use matrix substitution and gap scoring to retrieve globally aligned sequences, with variations in implementation. This can be used for aligning local sequences too.

Instead of searching by matrix substitution methods, using Graph database based Graphical Reasoning algorithm to find the aligned sequences is possible.

Word sequences and protein sequences are comparable as sequences with structure. Consider various human languages, two natural languages differ in grammar but work as sequence of words with structure.

If a grammar algorithm works for more than one natural language, it can also work in sequences found in other fields such as protein sequences. It can be used in motif extraction too because it searches based on patterns.

Check out the demo NaturalText Protein Search

Data used : Random FASTA formatted downloaded from NCBI 
Number of Protein Sequences : 25000
Database : custom developed General Purpose database as Graph Database
Graph Algorithm : Custom developed Graph Framework
Hardware Details : 2 core, 2 GB RAM.
Execution Details : Pure Python based single process execution

As this is a proof of concept and hosted in low config machine, it may be slower than existing solution.


Originally published at naturaltext.com.