Protein sequence classification also known as protein homology detection involves the task of classifying proteins to their respective superfamilies.
Any given protein sequence:
The ideal protein sequence data would look like,
Both of this protein sequence belongs to the kinase superfamily and their sequence length varies.
Primitive methods used were sequence alignment, profile-alignment, etc. later on researchers came up with different classification model using machine learning algorithms such as SVM. Basically, these sequence data had to be represented in an appropriated feature vector representation.
Recurrent Neural Networks have attracted the attention of researches because of its ability to extract sequential information from input data. So, here is a quick summarization of how the paper ‘Protein remote homology detection based on bidirectional long short-term memory’ used bidirectional LSTM to classify the sequence data.
Protein remote homology detection based on bidirectional long short-term memory. - PubMed - NCBI
BMC Bioinformatics. 2017 Oct 10;18(1):443. doi: 10.1186/s12859-017-1842-2.
To begin with, let’s go through the architecture of the proposed model.
The input sequence is the amino sequence made up of 20 naturally occurring amino acids. As these sequences for computation purpose are represented either using one-hot encoding or Position Specific Scoring Matrix.
One-hot Encoding- This will generate a feature matrix of 20xL(length of protein sequence) for each sequence entering the model.
Position Specific Scoring Matrix- This also generates a feature matrix of 20xL. Each entry represents the probability of occurrence of a particular amino group in the entire length of a given sequence(x_ij).
Now, this input is fed into the bidirectional LSTM which aims to extract the useful information from the given sequence. Each LSTM has a unit called a memory cell in which two inputs enters. This comprises of the subsequence within the sliding window and the output of LSTM of t-1(last) time step.
As you can see in the figure, the memory cell consists of Input gate, Forget gate and Output gate. The data i.e is the current subsequence(window size-3) and the output of the memory cell at the last time step enters through the input gate. The forget gate decides how much information can enter into the cell and the output gate controls what amount of information should be sent outside as the hidden value of the memory cell.
The bidirectional LSTM is made up of two unidirectional LSTM. The other unidirectional LSTM is reverse. The output of two unidirectional LSTMs is concatenated into a single vector representation.
So, at any time t, the hidden value outputted from a bi-directional LSTM is a function of forward LSTM and backward LSTM.
The output of different memory blocks contains information which needs to be given appropriate weight. The time distributed layer receives input as output from the memory cell and generates a single value for a particular subsequence. All these values are concatenated into a single vector representing the input sequence.
This vector representation enters into the final output layer which is a fully connected network with one node for binary classification. Hence, giving the probability whether the entered input sequence belongs to that superfamily or not.
For the experiment purpose, the paper has used the length of protein to be 400.