Table of Contents
- Source of data
- Business Problem
- ML Formulations
- Objective and Metric
- First cut approach
- Feature Engineering
- Comparison of all models
- Conclusions and Future Work
Introduction: What is MoA?
In pharmacology, the term Mechanism of Action (MoA) refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect. A mechanism of action usually includes mention of the specific molecular targets to which the drug binds, such as an enzyme or receptor. Receptor sites have specific affinities for drugs based on the chemical structure of the drug, as well as the specific action that occurs there.
Source of data
The data is taken from the GitHub of the following link: https://www.kaggle.com/c/lish-moa
A short brief about the dataset.
In this competition, the task is predicting multiple targets of the Mechanism of Action (MoA) responses of different samples. Samples are drugs profiled at different time points and doses. Dataset also consists of various group of features and there are more than two hundred targets of enzymes and receptors.
- sig_id is the unique sample id
- Features with g- prefix are gene expression features and there are 772 of them (from g-0 to g-771)
- Features with c- prefix are cell viability features and there are 100 of them (from c-0 to c-99)
- cp_type is a binary categorical feature which indicates the samples are treated with a compound or with a control perturbation (trt_cp or ctl_vehicle)
- cp_time is a categorical feature which indicates the treatment duration (24, 48 or 72 hours)
- cp_dose is a binary categorical feature which indicates the dose is low or high (D1 or D2)
Earlier, scientists got drugs from natural resources. Paracetamol (commonly known as acetaminophen in USA) were being used as medicine decades before its biochemical activities were understood. At present with new technology discovery of drug change from its previous theoretical method to modern more application-based approach for understanding its biological mechanism on a disease. For this reason, scientist used mechanism-of-action to evaluate a result of a drug to its protein target.
In the modern time when we have big data, ML & AI techniques are used to predict and outcome by some algorithm fitted model. For biological research now a days ML plays a vital role. We can divide the ML formulation steps as follows:
· Data Pre-processing
· Model Build Up & training the dataset
· By using the train data prediction, predicting the test data.
ML & AI in predicting the biological reaction of drug became a growing interest now. We can use supervised learning and semi-supervised learning for drug target interaction prediction.
Machine Learning and more recent growth of Deep Learning opens many opportunities in drug discovery. With availability of larger dataset, it can be expected that in upcoming years ML & AI will bring a rapid growth in biological research field. Also, in modern days computer became more powerful with lots of RAM which can process data in large number and more efficiently. ML models will deliberately create improvement and new, intriguing applications are expected to follow.
A drug is biochemical predictive model can be obtained by using ML approaches on pre-clinical datasets. Then it could be cross validated by using early-stage clinical patient samples. After judging its accuracy, it can be used to support the clinical development of a drug, and also to infer its mechanism of action.
Machine Learning Approach
We are supposed to identify the Mechanism of Action (MoA) of a new drug based on the available information of cell viability and gene expressions and their target MoA. In this problem scientists seek to identify a protein target associated with the disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short. Here our target variable is MoA and features used to predict MoA are cell viability and gene expressions. We have been provided the information about human cell responses to drug within a pool of 100 cell types and 772 gene expressions in addition we have access to MoA annotations of more that 20,000 drugs. Each drug can have more than one MoA, so this is an interesting part where we need to perform multi label classification on the data.
Objective and Metric
- There are two groups of target features; scored target features and non-scored target features. Both of those groups consist of binary MoA targets but only the first group is used for the scoring, so this is a multi-label classification problem.
- This is a multi-label binary classification problem, and metric used for the evaluation is mean column wise log loss. For every row, a probability that the sample had a positive response for each target, has to be predicted. For N rows and M targets, there will be N×M predictions.
- N is the number of rows (i=1,…,N )
- M is the number of targets (m=1,…,M )
- y^i,m is the predicted probability of the ith row and mth target
- yi,m is the ground truth of the ith row and mth target (1 for a positive response, 0 otherwise)
- log() is the natural logarithm
First Cut Approach
For any ML model the initial approach is to do an EDA, so checking the different variable of the dataset and plotting their graph will be the first step. It needed before jumping to a specific model. But, by using different type of technique from PCA, t-SNE to supervised technique like SVM and Random Forest and finally deep learning techniques we must conclude that for our baseline model deep learning technique like CNN (Convolutional neural networks) would be the best choice.
EDA(Exploratory Data Analysis)
The preliminary observation on the training dataset: The dataset contains 876 columns in which 3 are categorical features which are sig_id, cp_type, cp_dose. There are 772 gene expression features (from g-0 to g-771) & 100 cell viability features (from c-0 to c-99).
Gene Expression Features
Gene expression is the amount and type of proteins that are expressed in a cell at any given point in time.
There are 772 gene expression features and they have g- prefix (g-0 to g-771). Each gene expression feature represents the expression of one particular gene, so there are 772 individual genes are being monitored in this assay.
Cell Viability Features
Cell viability is a measure of the proportion of live, healthy cells within a population. Cell viability assays are used to determine the overall health of cells, optimize culture or experimental conditions.
There are 100 cell-viability features and they have c- prefix (c-0 to c-99). Each cell-viability feature represents viability of one particular cell line, and all experiments are based on a set of similar cells.
- There are three categorical features; cp_type, cp_time and cp_dose. Two of them are binary features and one of them has three unique values, so the cardinality among those features, is very low. All of the categorical features have almost identical distributions in training set.
- cp_time is categorical feature in the dataset and it has three unique values; 24, 48 and 72 hours. It indicates the treatment durations of the samples. Sample counts of different cp_time values are very consistent and close to each other in different targets. Sample counts are either extremely close to each other or 48 is slightly higher than the others.
- cp_dose is categorical feature in the dataset and it is also a binary feature. It indicates whether the dose of the samples are either low (D1) or high (D2).
- cp_type is categorical feature in the dataset and it is a binary feature. It either means that samples are treated with a compound (trt_cp) or with a control perturbation (ctl_vehicle). Samples treated with control perturbations have no MoAs, thus all of their scored and non-scored target labels are zeros.
Understanding Target Variables
Target features are categorized into two groups; scored and non-scored target features, and features in both of those groups are binary. The competition score is based on the scored target features but non-scored group can still be used for model evaluation, data analysis and feature engineering.
It is a multi-label classification problem but one sample can be classified to multiple targets or none of the targets as well. Most of the time, samples are classified to 0 or 1 target, but a small part of the training set samples are classified to 2, 3, 4, 5 and 7 different targets at the same time. Classified targets distributions are not very similar for scored targets and non-scored targets since there is a huge discrepancy of 0 and 1 classified targets.
Scored Target Features
The most commonly classified scored targets are nfkb inhibitor, proteasome inhibitor, cyclooxygenase inhibitor, dopamine receptor antagonist, serotonin receptor antagonist and dna_inhibitor, and there are more than 400 samples classified to each of them. The most rarely classified scored targets are atp-sensitive potassium channel antagonist and erbb2 inhibitor, and there is only one sample classified to each of them. A similar classification distribution is expected in test set.
There are lots of scored targets classified with the same number of times which suggests there might be a relationship between them.
- Two common dimensionality reduction techniques are PCA and auto-encoders. Those techniques are sensitive to scale, so it is important to standardize the data and make it unitless. For this purpose, cell viability and gene expression features are standardized with standard scaler. For evaluating information loss in different dimensionality reduction techniques, latent space dimensions are set to half of cell viability and gene expression dimensions.
- PCA is a linear transformation that projects the data into another space, where vectors of projections are defined by variance of the data. PCA results can be evaluated with reconstruction error and cumulative percent variance.
- Auto-encoders are neural networks used for reducing data into a low dimensional latent space. Most important features lie in this low dimensional latent space because they are capable of reconstructing it. Auto-encoders are slower and computationally expensive compared to PCA, and they are also prone to overfitting.
There are several classification models that are applied on the dataset which are as follows: Random Forest Classifier, Gradient Boosting Classifier, PCA & Auto-encoder for dimensionality reduction & LSTM.
Comparison of all models
Conclusion & Future Work
Among the all model autoencoder performs best. For the given dataset it was observed that applying autoencoder the neural network model gives a logloss of 0.0128 & a val_loss of 0.0192.
From the results we get by the different models, we can use it in our future work with different hyper-parameter tuning, and try to do some performance optimization to get more accurate results.
In pharmacology, the term Mechanism of Action (MoA) refers to the specific biochemical interaction through which a drug…
Shiladitya Majumder - Applied AI Course - Kolkata, West Bengal, India | LinkedIn
View Shiladitya Majumder's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping…