Mechanism of Action (MoA) Prediction

Saurabh Kumar
Analytics Vidhya
Published in
12 min readApr 3, 2021
source : Kaggle

Introduction and Overview

Mechanism of Action (MoA) is a label given to a molecule to describe its biological activity after treating it with a drug. In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target.

How to determine MoA?

Researchers used to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

Objective

The goal of this project is to make drug development more advance and fast by predicting the MoA of drug quickly. Generally researcher do is that they treat a sample cell with a drug and finds similarity in genomic databases which costs lots of time and money and makes this process very slow. So, the goal of this project is to speed up this process and make it more efficient.

The Data

Link to dataset : https://www.kaggle.com/c/lish-moa/data

The dataset is consists of many csv files as:

  • train_features.csv - Features for the training set.
  • train_drug.csv - This file contains an anonymous drug_id for the training set only.
  • train_targets_scored.csv - The binary MoA targets that are scored.
  • train_targets_nonscored.csv - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
  • test_features.csv - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.

The training and test set is consists of 876 features:

  • The First column is sig_id, which is the ID of each sample.
  • The Second column is cp_type, which indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle).
  • The Third column is cp_time, which indicates treatment duration (24, 48, 72 hours).
  • The Fourth column is cp_dose, which indicates the amount of dose (high or low).
  • 5th to 775th columns contains information about genes as column names g-0, g-1, g-2,…, g-771.
  • 776th to 876th columns contains information about cells as column names c-0, c-1, c-2,…, c-99.

The Target set consists of 207 features. First feature is sig_id and all other 206 features are the name of MoAs. Ex- nfkb_inhibitor, proteasome_inhibitor, cyclooxygenase_inhibitor, dopamine_receptor_antagonist, serotonin_receptor_antagonist, etc. There are 23814 data points to train the model and 3982 data points to test the model.

Challenge to solve

This is a multi label problem since there are 206 target features we have to predict. So, the main challenge here is to find the way to train model or models for these many target variables using machine learning algorithms which trains faster by giving the best accuracy.

I have used only machine learning algorithms to train the model. I have not touched deep learning for training purposes.

Evaluation Metrics

Log-loss, I have used log-loss evaluation metrics because each label in the target set is highly imbalanced and log-loss takes into account probability of prediction to compute loss. So, log-loss is robust to an imbalanced dataset. We can also use other metrics such as F1 score for evaluation.

log-loss formula

My first cut approach

Start by reading the dataset, then first I will do Exploratory Data Analysis (EDA) on the dataset and for gene and cell columns. I will separate them and then visualize them. Then in feature engineering I will encode the categorical features of the dataset, first I will try to apply scaling (StandardScaler or QuantileTransformer) separately on train and test dataset but during research I found that applying scaling separately decreases accuracy drastically then I will try to apply scaling on both train and test dataset simultaneously. Then I will do feature selection using PCA and VarianceThreshold and after that I will try to do a little data augmentation by adding some new features and combine all features (cp, gene and cell along with new features). Finally, I will split the train data into the train and validation dataset (using simple train-test split or stratified sampling) in a for loop and train and predict in the same loop so that every time we iterate the for loop we get different validation and train data. After getting the best algorithm, I will start training the model on the whole dataset. Since I will mainly try to use machine learning algorithms not deep learning, so I will try different techniques to predict these multi-labels, from creating different models for each label to trying scikit-MultiLearn library.

Exploratory data analysis (EDA)

The very first step in this case study is to perform EDA, so that we can analyze and visualize the contents of the dataset and extract valuable information about the data.

Reading datasets

We are using pandas library to read the datasets.

Reading the dataset
  • There is no null value present in the training dataset.
Code to check null values
  • We also have to check for duplicate value in the dataset. There can be a situation that a duplicate independent variable is present in our dataset. It is also possible that there is duplicate sig_id is present with differing feature values and there can be one more possibility that sig_id for data is different but the rest of the features values are the same. So, we will check all the possibilities.

There is no duplicate rows and duplicate IDs are present in the dataset.

Code to check for duplicate values
Output of code to check duplicate values
  • During checking for outliers we found that there is high variance in all the gene and cell columns, and we can’t declare them as outliers. By randomly box-plotting 50 gene columns we found that its values lies between 10 to -10 and by randomly box-plotting 50 cell columns we found that values of all the cell columns lies between 6 to -10.
Distribution of 50 random gene columns
Distribution of 50 random cell columns
  • In the dataset cp_type, cp_time and cp_dose columns are only categorical features and all other features are numerical features. During visualizing count of number of unique data points in each of the categorical features cp_type, cp_time and cp_dose we found that, in cp_type feature trt_cp data points are highly dominating over ctl_vehicle data points, this makes cp_type feature highly imbalanced. Rest two categorical features cp_time and cp_dose are balanced.
Data point distribution in categorical columns
  • Now checking for correlation between gene columns and cell columns among themselves. Among gene columns we can see that there is no very large correlations, but we can see some light tiny dots. After zooming to these tiny dots we found that there is some correlation and these tint dots are very less so, we can ignore them. There is no such correlation among cell columns.
correlation among gene columns
Zooming over tiny light dots
Correlation among cell columns
  • There is also a csv file (train_drug.csv) which contains drug id and sample id. It shows that which sample is treated with which drug. And here we found that the drog id ‘cacb2b860’ is mostly used followed by ‘87d714366’. As shown in graph below mostly top 9 drugs are used and rest of the drugs are used very less.
Uses of drugs
  • To visualize the number of MoAs in each sample, first I counted the number of MoA in each row for each sample. After plotting these counts I found that most of the samples have 0,1 and 2 MoAs and there is also some sample which have 7 MoAs and there are very less samples with 3,4,5,6 and 7 MoAs.

Feature Engineering

This is one of the most Important step in data preprocessing. In this step we modify the data and make it ready to use according to which we found in our Exploratory Data Analysis (EDA).

  • Starting with encoding categorical variables, cp_type is nominal and cp_time and cp_dose is ordinal categorical features. In cp_type and cp_dose there are 2 unique values but in cp_tine there is 3 unique value. I will encode these features of both train and test dataset.
Encoding categorical values
  • Form our EDA we found that there is high variance in gene and cell features. So we will normalize them using sklearn.preprocessing.QuantileTransformer library.
Code to normalize numerical features

Below are the box plots of 50 random gene and cell columns after normalizing. So, here we can see that there is no high variance in gene and cell columns and all of them became normalized.

Distribution of random 50 gene columns after normalization
Distribution of random 50 cell columns after normalization
  • Now I tried to generate some features using original features with the help of Auto-encoder. To use Auto-encoder, first I trained the encoder-decoder model using only training independent variables, means I passed training independent variables (875 columns) as both train and target variables.

What encoder-decoder model does is that, It takes input features, and as shown in figure below, in encoder section it tries to reduce the number of features and in decoder section it tries to recreate the same features as output. The pink section as shown in the figure is called bottleneck.

source : https://towardsdatascience.com/generating-images-with-autoencoders-77fd3a8dd368

So, to generate new features I have only used encoder section upto bottleneck. Using encoder section I generated 50 new features (tried a various number of new features such as 10, 50, 100 and 200 but, 50 worked the best). Now merged 50 newly generated columns of train and test dataset with their original columns. After merging the number of columns increased to 926.

Code for encoder-decoder model

Encoder-decoder model summary.

Model summary on encoder-decoder model

After getting the pretrained encoder model, saved the model to use later (during creating web app) so that I don’t have to train the encoder-decoder model regularly.

Saving only encoder model
Loading encoder model and generating features

Model Creation

Since we are done with our exploratory data analysis, data cleaning, feature engineering and all data preprocessing steps, now we are ready to apply machine learning algorithms to our preprocessed dataset.

Since this is a multi label problem, so I tried various methods to solve this problem. But, here I will discuss only that method that worked very well and fast with the best accuracy and I will discuss other methods later in this blog.

The best method to solve this multi label problem is using machine learning algorithms with OneVsRestClassifier().

So, what is OneVsRestClassifier?

OneVsRestClassifier is used to fit machine learning algorithm to multi-class as well as multi-label problem. It is also known as one-vs-all classifier. To use this strategy the target labels should be 2D binary matrix.

To learn more about OneVsRestClassifier visit : https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

Machine Learning algorithms that I used with OneVsRestClassifier are:

  • Logistic Regression
  • Naive Bayes Classifier
  • Random Forest Classifier

All other ensemble models or KNN model or SVM with rbf will take a lot of time to train.

Logistic Regression

Out of all the algorithms logistic regression performed the best with OneVsRestClassifier. It took less time to train and gave very low log-loss. I have also performed hyperparameter tuning to find best value for hyperparameters.

Logistic Regression code to train model
Logistic Regression cross validation output

By using logistic regression we got log-loss of approx 0.016, which is very low, means our model is good.

Since this is a multiclass problem so, to compute log-loss we have to use loop to compute log-loss of each label and final loss will be mean of all the losses.

Naive Bayes Classifier

Naive Bayes Classifier not performed well in this case.

Naive Bayes Classifier code to train model
Naive Bayes Classifier cross validation output

By using naive bayes classifier we got log-loss of approx 7, which is not good, means naive bayes classifier did not performed well.

Random Forest Classifier

Random forest classifier also performed well but not as good as logistic regression and also it took more time to train.

Random Forest Classifier code to train model
Random Forest Classifier cross validation output

By using random forest classifier we got log-loss of approx 0.020, which is very low, means our model is performing well.

Comparision of all models

Out of all these models Logistic Regression with OneVsRestClassifier preformed the best.

Below is score I got on Kaggle after submission for my final model.

Kaggle score

Things that we tried but didn’t worked

  • scikit-multilearn library — I tried this library to train this multilabel problem, but since this problem has large numbers of labels, so this library took a lot of time to train the model but failed. This scikit-multilearn library works well when we have small dataset with less number of labels.
  • Converting multi-label to multi-class — I tried to convert this multi-label problem to multi-class problem and it created a lots of classes and almost all the data points became unique and became very hard to handle.
  • Xgboost and SVM with rbf kernel with OneVsResClassifier — Tried using these algorithms to train the model and they took a lot of time (over 7 hours) to train but didn’t trained completely in those time so, cancelled the training process.
  • Oversampling — Since our target labels are highly imbalanced, so I tried to do oversampling before training the model ,but it didn’t improved the model performance so dropped this idea.
  • PCA — For feature selection I used PCA to select features and merged with original dataset, but it didn’t improved the model performance. So, instead of this I used auto-encoder and it improved the model performance.

Things that we tried and also worked

One more method that I tried and worked very well and performed as good as logistic regression withOneVsRestClassifier was that, for each label I trained a logistic regression model and trained total 206 models and stored them to use during prediction. During prediction I called all the models using loop to predict value of each labels. Reason that I have not picked this method is that it takes more time than OneVsRestClassifier method.

Future work

  • I would like to try a 2 step method. In this method first I will predict whether MoA is present in the sample or not and if MoA is present then it will go to second step in which we will predict the MoA.
  • I would like to try deep learning techniques to solve this problem.

Conclusion

  • Logistic regression with OneVsRestClassifier performed best with log-loss of approx 0.016 and it is also very fast to train.
  • During EDA, if data points are away from whiskers of box-plot then it need not be outliers.
  • If we have dataset with large number of features and large number of target lables then machine learning model with less time complexity works best and machine learning algorithm with high time complexty didn’t work.

Thank you for reading:)

For full code you can visit my Github repository containing Jupyter notebooks and for whole project you can visit here.

References

You can also connect with me on linkedin.

--

--

Saurabh Kumar
Analytics Vidhya

Actively looking for an opportunity in the field of Machine Learning and Data Science.