Machine Learning for Protein Classification

Published in

Krishna Yerramsetty

9 min readAug 9, 2016

This is a project that I did for a former colleague and friend. Things did not work out due to logistical reasons, but I spent 15 hours on it, and I thought maybe I could at least get a blog post out of it :)

The project is about protein classification using machine learning. Protein classification is an exciting new area where machine learning is starting to make a significant impact. Classifying proteins traditionally has been performed using distance measures between sequences based on alignment methods like BLAST, or more sophisticated model-based methods like HMM-HMM. However, for some protein classification and functional assignment tasks, local sequence based methods under-perform feature extraction followed by Machine Learning (ML) models. I am not an expert on proteins, so I can’t really do justice to state-of-the-art in protein classification. Instead, I will focus on a specific classification task of classifying thermophilic proteins and talk mostly about the ML models I’ve used to achieve that task.

Most proteins are stable at temperatures less than 45$latex ^\circ$C. Thermophilic proteins are stable at higher temperatures and are therefore better suited as biocatalysts in some applications. Extracting features and using ML models will help understand what makes some proteins stable at higher temperatures, and accelerate protein engineering and design. Lin and Chen put together a dataset of ~900 thermophiles and ~800 mesophiles (non-thermophilic proteins), and applied support-vector machines (SVMs) to classify thermophiles from mesophiles. More recently Ofer and Linial, published a bioinformatics toolkit called PROFET, that extract features from amino-acid sequences and then uses random forest to build a classification model. They have investigated the same thermophile and mesophile dataset that Lin and Chen have investigated and report similar cross-validation performance. However, neither of the aforementioned studies investigated model performance on an external hold-out set. Here, I use PROFET to extract features from amino-acid sequences and then use extreme-gradient-boosting (XGB) and/or Bayesian networks (BNs) to classify thermophiles from mesophiles using these features.

The models I built will be evaluated based on how well they perform on cross-validation data (must be at least as good as results reported by Ofer and Linial) and also on an external hold-out set. In addition, using BNs, we can understand which features are important for classification. This will hopefully help an expert design better thermophiles. Therefore, the success criteria are two-fold:

Comparable performance on cross-validation and external hold-out data sets.
Be interpretable by an expert in protein designing.

In general, any model development involves four distinct stages:

Extract data
Pre-process data
Build & test the model
Interpret the model & results

I. Extract data: The data (features and targets) were extracted using PROFET which is conveniently available on Github. The feature extraction pipeline needs amino-acid (AA) sequence of the proteins and calculates ~1,100 biophysical, letter based features (Ex: k-mers), local potential features, information based statistics (Ex: entropy), and AA-scale based features. Again, I’m not an expert on these things, so please refer to Ofer and Linial for details. The data has 913 thermophiles and 791 mesophiles and each protein has 1,170 different features. Therefore, the data matrix has 1,704 rows and 1,170 columns.

II. Pre-process data: Firstly, I removed the near-zero variance features or variables. This reduced the total number of features from 1,170 to 1,006. I did not however remove the highly correlated variables. Next, I wanted to create a representative hold-out set to assess how well my models do on unseen data. To do this, I used the Kennard-Stone algorithm, which tries to select new samples that maximize the Euclidean distance (on the important principal components) from the samples already chosen. Using this algorithm , I’ve selected 426 samples (228 thermophiles and 198 mesophiles) for external (hold-out) validation, and 1,278 (685 thermophiles and 593 mesophiles) for calibration (which includes training and cross-validation). The figure below shows the plots for 5 principal compoenents (PCs) plotted against each other. The green dots represent the external set samples and the black dots represent the calibration set samples. You can see that there is no region in the PC space that is only made of black dots or green dots, which means that the selected external set is representative of the calibration set.

III. Building the Models: I’ve implemented three different model types:
Model XGB: Extreme Gradient Boosting (XGB or XGBoost) using all features.
Model XGB2: Feature selection using 50 most important features from XGB + build model using XGB on these 50 features.
Model BN-XGB: Feature selection using Bayesian Networks (BN) + build model using XGB on these selected features.

All 3 model types were implemented in R using the xgboost and bnlearn libraries. The success criteria used to build these models is Accuracy = (TP + TN) / (TP + TN + FP + FN), where TP and TN mean true positives and true negatives, respectively, and FP and FN represent false positives and false negatives, respectively. The reason I chose XGB is beacuse of its success on Kaggle competitions, and also based on my personal experiences where I have used XGB for a diverse set of applications. Regularization is built into the XGB objective function and is not over-sensitive to hyper-parameters. Bayesian Networks (BNs) on the other hand are a relatively new technique that have the potential to help us move beyond associations and understand causation. An advantage of BNs is they can be used for feature selection, which can then be followed by for example, a non-linear model like XGB to model the non-linearities between the selected features and the target variable. In this work, I used BNs to identify the features that are the immediate neighbors of the target node (binary node that is being predicted). These features are then used by XGB to build a model to predict the binary class of the protein.

The 1,278 calibration samples were used for both training and for cross-validation purposes. The performance on cross-validation was used to choose hyper-parameters for XGB, namely the maximum depth of trees (deeper the trees, higher is the chance of over-fitting, while shallower trees might be under-fitting), max rounds of boosting (which is the number of trees to add, essentially) and eta (the shrinkage parameter that is multiplied with the new weights at each boosting step to reduce over-fitting), and sub-sampling (both for rows and columns. Reduces over-fitting). There are other parameters too, but I have generally only tuned these and they get the job done in most cases. One thing to remember is that eta and number of rounds are inversely dependent, and you need to increase the rounds if you decrease eta and vice-versa. Refer to Tianqi Chen’s very helpful slide deck for more information. The final hyper-parameters chosen for each of the three models is tabulated below. XGBoost is fairly insensitive to the hyper-parameters, and most times, the default values work well.

For cross-validation, I chose 10-fold CV, which gives a good bias-variance trade-off compared to Leave-One-Out CV (lowest bias, but highest variance due to highly correlated error estimates on each fold). Repeated k-fold CV is the best option, but I have only done one 10-fold CV to save time here. The mean accuracy ((TP + TN)/(Total samples)) for each of the models along with the standard deviations is shown below. For each of the three model types, XGB, XGB2, and BN+XGB, I’ve built 10 different models with 10 non-overlapping data sets. The external hold-out test set was evaluated using these 10 different models for each model type, and the mean values are tabulated below as well. What is clearly evident is that feature selection based on XGB feature importance under-performs feature selection based on Bayesian networks (BN).

The confusion matrices (assuming thermophiles as a positives and mesophiles as negatives) for each of the three model types on the calibration data and the external data are shown below. Again, XGB and BN+XGB model types perform well with BN+XGB performing slightly better on the external hold-out set.

IV. Model Interpretation: Once, we have the model we should always try to understand the drawbacks in model predictions. This can be done in several different ways. One of my favorite ways is target shuffling, where you randomly shuffle the output value (binary class in the present case) a few hundred times and see how well the results compare with the true model. Since, this is time-consuming for the present case, I won’t treat this here. Also, a method to understand black-box models by local-approximations has been recently proposed. I will try to address these in a future post. For now, I will talk about how sensitivity and specificity are not the end of story when it comes to applying our models in the real-world. This is because most things that we are trying to detect using machine learning occur rarely in nature. In this particular case, it is reasonable to believe that thermophilic proteins are much rarer than mesophilic proteins. Other examples include fraud detection, disease detection, failure detection which are again rare in nature. However, the data sets used to build the predictive models are balanced data sets (either by biased sampling or boot-strapping). This leads to a false sense of predictive power based on the balanced data. For example, consider the BN+XGB model performance on the balanced (almost)external hold-out data: The predictive power of predicting Thermophiles(TP / (TP+FP) ) is 195/236 ~ 83%. Predictive power signifies the the percentage of proteins that are true thermophiles of all the proteins predicted as thermophiles by our model. In probability terms we can refer this to as P(The|Pos), if we assume thermophiles as positives and mesophiles as negatives. Using Bayes theorem, we can re-write predictive power in terms of sensitivity and prevalence of positives in the population:

$latex P(The | Pos) = \frac{P(Pos | The) *P (The)}{P(Pos)}&s=3$

The value for the prevalence, P(The) is ~ 0.53, based on the external set distribution. However, if we assume that in real-world, this prevalence is ten times lower, i.e P(The) = 0.05, the predictive power drops down to ~ 1 8%, even though the sensitivity stayed the same. Therefore, it is always good to check the assumptions involved with the modeling world, and make sure that these are valid in the real-world where the model is deployed.

The next thing to do once we have a model and understand its limitations, is to interpret it using some knowledge associated with the particular field. I have very little knowledge about proteins, but if I were involved in this project, I would work with experts in protein structure and function and try to create hypotheses about the possible features that separate thermophiles form mesophiles. This would help the experts design thermally stable proteins in the lab. For example, based on the features selected by Bayesian networks, I can hypothesize that amino acid composition of glutamic acid and glutamine are important in making thermophiles different from mesophiles.

Originally published at Krishna Yerramsetty.

Machine Learning for Protein Classification

Written by Krishna Yerramsetty