Predicting peptide immunogenicity with deep learning

Yuan Tian
4 min readAug 26, 2018

--

Epitope recognition by T cell © Juan Gaertner

My colleague and friend Dr. Sandeep Kumar Dhanda recently published a paper where he used a neural network-based method called NNAlign to predict the ability of peptides to induce human CD4 T cell responses (termed immunogenicity). Notably, he achieved an average area under the ROC curve (AUC) score of 0.7 on 57 independent test sets.

Peptides, or epitopes, are short chains of amino acids derived from infectious pathogens, allergens, cancer and so on. CD4 epitopes are presented by major histocompatibility complex (MHC) class II molecules on antigen-presenting cells (APC) and recognized by CD4 T cells. The recognition can potentially trigger subsequent events in responding CD4 T cells.

Epitope recognition by CD4 T cells (image source: https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(03)01044-1/fulltext)

There are about 20 amino acids, each of which can be represented by a letter. Peptides are just sequences of amino acids and thus can be represented by a series of letters, just like short texts or strings. This inspired me to use a sequence model to predict peptide immunogenicity and compare its performance to that of Sandeep’s model.

Peptide (image source: https://www.apexbt.com/c-myc-peptide.html)

Long short-term memory (LSTM)

LSTM is a variant of recurrent neural network (RNN) and is well-suited for this kind of problem. Please read this blog post for a more detailed explanation of RNN and LSTM. In our case, we will read each peptide sequentially one amino acid at a time and decide whether the peptide can induce CD4 T cell response. LSTM can “remember” the information it learns from previous animo acids as it moves to the next one. Therefore, it may be able to learn the interactions between amino acids at different positions, which may influence peptide immunogenicity.

LSTM structure (image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Building a LSTM classifier for immunogenicity prediction

I used the datasets provided in Sandeep’s paper for this project, and the code was built upon PyTorch’s tutorial written by Sean Robertson. For training, each peptide is labeled as either positive or negative based on its immunogenicity. Therefore, this is essentially a binary classification problem. We will first need to convert each peptide to a length by 1 tensor, where each number represents an animo acid at that position.

Next, we’ll build a LSTM model using PyTorch’s LSTM module by simply providing input size and hidden state size. Here we also use embeddings for amino acids before feeding to LSTM in hope of learning “hidden” information of these amino acids. For more about embeddings, please read my previous blog post.

Since the peptides are of different lengths, I decided to train one random peptide at a time instead of batches, which is also called stochastic gradient descent (SGD). We can see that the loss of the model gradually decreases. Since we are using SGD, the curve doesn’t look very smooth.

Training loss over iterations

We can evaluate the the performance of our model on the training set by looking at the confusion matrix. The model performs descent on both positive and negative cases.

Finally, we test the model using 57 independent test sets from 57 different papers, and calculate the mean AUC score. Our model achieved an average AUC score of 0.62, which is not as good as Sandeep’s 0.7.

Conclusion

Here we used character-level LSTM to classify the immunogenicity of peptides, which are just like strings. Although we weren’t able to outperform published result, there is ample room for improvement. For example, we could tune the various parameters and try bidirectional LSTM, n-grams instead of single animo acids, and so on.

Source code

Th source code and datasets used in this project can be found at: https://github.com/naity/epitope_prediction

--

--

Yuan Tian

💻🧬Decoding life's data with AI & ML | Computational Biology (LinkedIn: www.linkedin.com/in/ytiancompbio)