Mechanism of Action (MoA)- The Kaggle Competition

Nitin Vashisth
Analytics Vidhya
Published in
5 min readDec 5, 2020
Photo by Markus Spiske on Unsplash

This was my second kaggle competition where I took part. Frankly speaking, I had lot of learning in this competition. Many different Grandmaster’s and expert’s discussion were really mind blowing. Discussion around different genre in the competition was amazing e.g. feature engineering, data augment, different modelling, transfer learning and final was stacking/blending. So I am going to talk about different things (my experience), I learned and applied in this competition which helped us (Christian & Hasan) to grab bronze medal (top 8%) — my first one :D

So the competition started in September and around 4373 teams participated in competition. I will now directly jump into working.

Exploratory Data Analysis

About the description of the data you can click here and read through, which will save me from explaining what actually mechanism of action mean here. Alright, let us jump directly on to the details. We looked into the number of data points available in each category of “cp_time”, “cp_type”, “cp_dose”. We can easily observe from this graph below that, “ctrl_vehicle” is in least quantity. Also from the description of dataset is obviously clear that, we can drop all the record related “ctrl_vehicle”. Now you might be wondering, why this guy is suddenly dropping record with “ctrl_vehicle”. If you have missed this sentence ( “with a control perturbation (ctrl_vehicle); control perturbations have no MoAs”) while reading the description, then no problem, I got you covered.

train data (gene & cell)

As this is a multi-classification problem, there is one information which is always required to be checked — how skewed is our data-set? In order to check that information, we plotted the graph of the same and you can see the classes are really imbalanced.

Plotting around top 40 classes (can not do all, as there are around more than 200+)

Another information which we already have, the gene and cell data is totally independent of each other. Hence we can separate these two information and then we can try to plot the distribution of each. We can easily observe that, the data is skewed and has outliers.

Distribution of train data (gene and cell)

Feature Engineering

So now now we see there are lot of problem with our data, that is why, it demands us to make it in real good shape, so that we can feed the data to our fancy models. So let’s note down, the kind of problem which we have and also simultaneously use certain technique to deal with the problem.

  1. “ctrl_vehicle” does not have any mechanism of action.

Action — Remove all ctrl_vehicle related record as they will have mechanism of action of all type of drugs as 0.

2. Classes are highly imbalance.

Action — We need a way to prevent this over-fitting. We opted for 7 K-Fold Multi-stratified splitting and along with different seeds to deal with randomization.

3. Data is not standardize e.g. not normally distributed.

Action — We transformed the data using QuantileTransformer which make our data to behaving normally(Click on the title to read about it)

Quantile Transformed data (now the peaks are not so pointy)

4. Adding more information of PCA is always a good idea. We looked for possible number of component which are able to capture minimum 95% of the variance (both gene and cell separately). We considered 600 and 50 components from gene and cell respectively.

Number of component selection from gene & cell

5. Another feature selection method which we used was VarianceThreshold. Idea behind this, to drop all low variance feature (as it contains less information).

Modelling

Now time comes to prepare these model pipeline which involves dataloader, training and inference. You can look below for our first model which we tried and could achieve better score on leadership board. We also included LabelSmoothing in our implementation. Too read more about LabelSmoothing, you can simply visit this blog

  1. Pytorch Model — CV Loss — 0.01562

We had pytorch model with 3 dense layer along with three batch normalization layer and which help our batches to stay at the center. Then we did weight normalization to prevent our gradient of vanishing/exploding and leaky relu as our activation function. Dropout helped in to prevent overfitting.

2. Keras Model — CV Loss — 0.01569

Here we tried multiple model (resnet and dnn) with similar type of architecture (but different permutation and combination of layers) — where we increased/decreased dense layer, BN layer, dropout values. We trained all these model and took mean of all losses and prediction.

3. Tabnet— CV Loss — 0.01579

Transfer learning was one of the technique which everybody tried in this competition. It turned out to be converting these features into image format and apply efficientnet was one of best transfer learning technique. It made one team to win this competition.

After that we stacked all of these model together and got us public leadership board rank of 560+.

Post completion of competition, we got the private LB score as 324 (which was top 8%), helped us to grab bronze in this competition.

I understand this blog is not complete without code and snippets. I would suggest, try to get in touch with by writing in comments. Then I would be able to share my notebook with EDA and Modelling part.

If you liked it, then please clap, share and comment.

--

--