Kaggle Competition-Don’t Overfit II

Sahil -
Analytics Vidhya
Published in
16 min readApr 23, 2020

Hi Guys!

I just started my first Kaggle competition problem — Don’t Overfit II. Kaggle website is good for practicing Machine Learning or Deep Learning problems where you can find different kinds of problems available and can try different techniques and adapt to learn what scenarios we can apply which techniques.

Before proceeding with this blog, I listed down the content that is going to be explained here. Let Get Started!

  1. Kaggle Problem
  2. Related Work
  3. My Approaches Apart from Related Work
  4. Exploratory Data Analysis (EDA)
  5. Feature Engineering
  6. My First Cut Approach
  7. Modeling
  8. Experiment and Results
  9. Conclusion
  10. Future Work
  11. References

1. Kaggle Problem

Don’t Overfit! II is a challenging problem where we must avoid models to be overfitted (or a crooked way to learn) given a very small amount of training samples.

As per Kaggle say,

” It was a competition that challenged mere mortals to model a 20,000x200 matrix of continuous variables using only 250 training samples… without overfitting. “

Source of Data —

Dataset can be download here:

https://www.kaggle.com/c/dont-overfit-ii/overview

So, with the small amount of training data given, we must do the task carefully to avoid overfitting easily.

What do we need to predict? We are predicting the binary target value (binary classification) associated with each row which contains 300 continuous feature values. Also without overfitting with the minimal set of training samples given.

Evaluation —

As per the Kaggle problem statement, the score will be evaluated based on AUROC (Area Under Receiver Operator Characteristic) between predicted target and actual target.

ROC — It calculates the performance measurement at different threshold values for binary classification. Then we get the curve after measurement of ROC, we find the area under the ROC curve. That is called Area Under ROC.

Since the amount of training data is small, we can easily do this task with Machine Learning models.

2. Related Work

  1. Hyperparameter Tuning (Blog: https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624) — It used Ridge, Lasso, Elastic, LassoLars (Lasso model fit when Least Angle Regression), Bayesian Ridge regression, Logistic Regression, and SGD classifier were used in Machine Learning models and evaluate mean cross-validated score and their standard deviation. The parameters keep as per the default values. Found that logistic regression and SGD are the two top models better than others. Then they apply a tuning strategy (Both GridSearchCV and RandomSearchCV) in these models only. Best CV score (not Kaggle Score) got using GridSearchCV in Logistic regression model: 0.789
  2. Just Don’t Overfit (Blog: https://medium.com/analytics-vidhya/just-dont-overfit-e2fddd28eb29) — It used Standardization to fit in the range between -1 and 1 in training data and transform in test data based on the mean value and standard deviation evaluated from training data. Used regression model LASSOCV (Least Absolute Shrinkage and Selection Operator Cross Validation). This LASSO model used to find the hyperplane which reduces the residual error with the additional shrinkage parameter 𝜆 with absolute weights to avoid overfitting and they did cross-validation with different values of 𝜆 to find the right hyperplane. Test data score: 0.843
  3. Don’t Overfit! — How to prevent Overfitting in your Deep Learning Models (Blog: https://nilsschlueter.de/blog/articles/dont-overfit-%E2%80%8A-%E2%80%8Ahow-to-prevent-overfitting-in-your-deep-learning%C2%A0models/) — It used base model as MLP Deep Learning which contains two hidden layers: 128 units and 64 units. Since it’s a binary classifier, loss used binary cross-entropy and used Adam optimizer. Kaggle score: 59%. Next approach, they create a simplified model that contains one hidden layer 8 units and applying the Dropout layer with a 0.4 dropout rate. Additionally, they put the early callback at val_loss with patience as 3. For this new model, the Kaggle score achieved 80%.
  4. How to not overfit (kernel: https://www.kaggle.com/artgor/how-to-not-overfit) — Perform EDA: Plots on fee features, Correlation score among features, and did Basic modeling (Logistic Regression). Got Kaggle score: 0.7226. Using ELI5 tools which give the weight importance for the model. Observed and took the top 32 importance and trained basic modeling again. Kaggle Score: 0.7486. Concluding, after performing various feature selections techniques like Permutation importance, SHAP, and SequentialFeatureSelector, it didn’t improve very much. Then they perform different various models with hyperparameter — Logistic regression, Gaussian Naïve Bayes, Adaboost, Extratrees, Random Forest, Gaussian Process Classification, Support Vector Classification (SVC), kNN, Bernoulli Naïve Bayes, SGD. Blend with Logistic regression and SVC, Kaggle Score: 0.831. They tried feature engineering like Polynomial Features, Adding statistics, adding distance features by taking kNN with k=5 to calculate mean, max, and min distance. And then perform several feature selection using by sklearn package like percentile, SelectKBest, RFE, and apply model Logistic Regression and GLM. Still, the CV score remains below 80%.

3. My Approaches apart from Related Work

I made further 5 changes

  1. Apply Feature Engineering
  2. Apply Oversampling technique: SMOTE (Synthetic Minority Oversampling Techniques)
  3. Perform the top importance feature of that model using feature forward selection and train again.
  4. Apply Dimension Reduction technique using PCA and Truncated SVD
  5. Calibrate the model

4. Exploratory Data Analysis (EDA)

** About Dataset **

About train.csv — 250 samples and 300 features and 1 class label and 1 Id: (250,302)

About test.csv — 19750 samples and 300 features and 1 Id: (19750,301)

Kaggle Source

** Describe training data **

df_train.describe()

Info of training data (Just to check whether there is any missing value)

df_train.info()

** Probability Density Function (PDF) **

From Wikipedia

In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value.

Now, let us observe how each feature follows which types of distributions.

Let us see how each feature are overlapped with each other.

Let us see how each feature is overlapped based on the target value.

** Cumulative Density Function (CDF) **

From Wikipedia

In probability theory and statistics, the cumulative density function of a real-valued random variable X, or just distribution function of X, evaluated as X, is the probability that X will take a value less than or equal to x.

Let us see CDF of each feature.

Let us try to overlap CDF with each other to see the difference in the plot.

Let us see how each feature is overlapped based on the target value.

** Box Plot **

** Violin Plot **

** Scatter Plot **

** Visualize in 2D (Using TSNE) **

In the 3D plot,

** How is balance dataset ? Imbalance or Balance? **

Very Important to notice!

It is not highly imbalanced data but decent.

5. Feature Engineering

All the features are continuous. So, I created the basic stats like mean and standard deviation, trigonometric, hyperbolic functions, and exponents function.

** Mean and Standard deviation value of each sample **

** Trigonometric function **

Here I defined visual_fe to plot actual feature vs after feature engineering applied

Let us see how the plot looks like after applying a trigonometric function to actual feature data and then compare it based on the target value.

For sin(x) function

For cos(x) function

For tan(x) function

After that, we take the ‘mean of each trigonometric function’

** Hyperbolic Function **

Similarily take a look at how it’s transformed.

For sinh(x) function

For cosh(x) function

For tanh(x) function

Similarly, we also take its ‘mean value of hyperbolic function’.

** Exponents Function **

For exp(x) function,

For exp(x)-1

For 2^(x) function

Take a ‘mean of each exponent function’.

** Some polynomial Operation **

Like x², x³, and x⁴ and take ‘mean of each polynomial operation’

For visualization

For ,

For ,

For x⁴,

** To wrap up into single function **

6. My First Cut Approach

Till now, we explored the data like the number of training samples given, how each feature is followed which type of distribution, is dataset balanced or imbalanced, etc. We also see some basic feature engineering applied using basic mathematics.

Now, ‘Since it is an imbalanced dataset, we have to deal with it first’.

My first point(or understanding approach to solving the problem):

There are two ways to deal with that using: undersampling and oversampling techniques.

Undersampling means just reduce the number of majority class samples by resampling. How? We take ’n’ random samples from the majority class such that ’n’ is the same number of minority samples presents.

Illustration and example of an undersampling technique

Disadvantage: We lost some information from the majority class (while resampling from 990 samples to get only 10 samples, we lost 980 pieces of information of target class 1 (or majority class ) in the above figure).

Oversampling means either we add new synthetic data based on similar characteristics or attributes or create duplicated data of minority class to increase the number of samples as same as the number of samples in the majority class.

Illustration and example of an oversampling technique

SMOTE (Synthetic Minority Oversampling Techniques) is one of the oversampling techniques.

Hows it works
1. Specifically, a random example from the minority class is first chosen.

2. Then ‘k’ of the nearest neighbors for that example are found (typically k=5).

3. A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

Example and Illustration in this blog.

Second point:

Since we understood that this is an imbalanced dataset, AUROC is the correct way to be used as evaluation metrics.

Note:

We should aware that why we can’t use Accuracy as metrics. Now, we understood that this dataset is imbalanced, so Accuracy is not good for this kind of problem.

Third point:

This imbalanced dataset is not highly imbalanced but decent. So, we don’t know which works best so we will try and attempt to do with SMOTE and without SMOTE.

Fourth point:

Since it is an imbalanced dataset, we can’t lead the models’ prediction into uncertainty. We need to calibrate our models.

Last point:

This kind of problem is a binary classification, we are going to predict from those models that are used classification purposes.

Now, we have to bring data within the bound range which is also called feature scaling. There are two ways to do that: Normalization and Standardization.

Normalization is a squashing technique to bring all values of each feature into a range between 0 and 1 with the parametric value of max and min of each feature.

Standardization is a squashing technique that brings all features to shift into a mean position and takes a standard deviation from mean i.e. bring all values into range -1 and 1 from the mean position. The parametric are mean and standard deviation.

No rule of thumb works the best. So, we will try both Normalization and Standardization.

7. Modeling

I used 7 models with hyperparameters.

  1. kNN (k-nearest neighbor)
  2. Logistic Regression
  3. SVC (Support Vector Classification)
  4. Random Forest
  5. XGBoost (optimized distributed gradient boosting)
  6. Stacking Classifier
  7. Voting Classifier

I used hyperparameter using GridSearchCV with CrossValidation with stratify target variable which is provided by the sklearn toolkit.

The difference in modeling wth other related work is that each model after hyperparameter tuning follows by the ‘Calibrated Model’. So after fitting the best hyperparameter, I passed these models into calibration because I already said before that ‘since the dataset is an imbalanced dataset, it can lead the models’ prediction into uncertainty’. (See figure below)

I also have shown the plot ROC and each model features importance. (It will not be feasible to show all the plots here. You can check the code at the end of the conclusion section)

In this below code snipped, I created 4 functions and in each function, I also have written docstring and explained what type of variable should be passed and what will return.

Before proceeding to ‘experiment and results’, let get some of the other operations which I’ve done:

Top importance feature:

For that to find top ’n’ importance features. First I trained the model (specifically). At the same model, I used to find the best features that can give more importance to that model. So, I used the forward feature selection.

The idea is simple:

  1. Initialize the number of top importance features of your desire.
  2. Iterate over all the features in train data.
  3. Stop when the condition of your desire is fulfilled or cross-validation score not improved after adding new features.

The illustration as follows:

<F₁, F₂, …, Fn> is the list of features and in the model, one is for training the data based on features and the other is for test data to evaluate the score for that feature.

Below, is the snipped code for the above-illustrated case.

Dimension Reduction Technique:

  1. PCA (Principal Component Analysis): It is the dimension reduction technique where it finds the direction of the axis (or PCA components) which gives the maximum information. The number of components (or desired dimension d’) can be done by plotting explained variance ratio vs d (number of features)
Plot explained variance ratio vs d to find the number of components in PCA

From the above graph, we can conclude desired dimension d’ = 175 is pretty decent which reserved data about 90–99% data.

2. TruncatedSVD: It comes from the Matrix factorization concept. SVD stands for Single Valued Decomposition which tries to decompose one matrix into the product of matrices.

where U is left singular matrix and V is the right singular matrix and ∑ is the diagonal matrix. Read pdf for more.

In truncated SVD, we chose k-dimension rather than d-dimension. See image below, After considering k-dimension (removing rest of the d-dim data shown)

To find correct k-dimension, we plot explained variance ratio vs d-dimension.

We consider k=175 as dimensions where data is about 90–95% approx.

8. Experiment and Results

1. SMOTE + FE + Standardization + ML Classification Model

2. SMOTE + FE + Normalization + ML Classification Model

3. SMOTE + Standardization + ML

4. SMOTE + Normalization + ML

5. Standardization + ML Classification Model with/without Top Features

6. Normzalization+ ML Classification Model with/without Top Features

7. FE + Dimension Reduction + Standardization + ML Classification Model (PCA d’ = 100)

8. FE + Dimension Reduction + Standardization + ML Classification Model (PCA d’ = 175)

9. FE + Dimension Reduction + Standardization + ML Classification Model (TruncatedSVD d’ = 175)

My Best Kaggle Score Yet!

The left column is Private Score (0.833) and the right column is Public Score (0.839)

9. Conclusion

  1. We performed EDA and most of the features on the basis on target were overlapping. It was assumed that it’ll be harder to predict for any model.
  2. We performed feature engineering using basics statistics, trigonometric function, hyperbolic function, and exponents function and also shown the plot of actual feature vs after feature engineering applied and also plot feature engineering on the basis on target label.
  3. If you observe experiments 1, 2, and 3, 4, that by applying feature engineering we got further enhance score than without applying feature engineering.
  4. We apply forward feature selection to get those features only which give more AUROC score in experiments 5 and 6. We got an even higher score than considering all features.
  5. We tried with dimension reduction to see that maybe we get better results in experiments 7,8 and 9. But no luck, even worse than considering all features.

After that, I was not satisfied with the result. I went through the discussion and came to know that by using LB probing strategies we can come in the top 5–10%. However, that kind of strategy is not used in real case scenarios and not even a good way to learn these types of techniques where you can’t apply in real case business problems.

Well Anyways! It doesn’t matter. What matters is to do different types of problems, try to find different approaches, strategies, and experiments (strategies by all I mean to say is those strategies where we can also apply in the real business problem. Not like LB probing). There is no single solution to any kind of problem. To become in domain expert in a particular field, do a lot of projects, practice more, learn new methodologies by reading research papers, read several blogs, and keep experimenting.

10. Future Work

  1. Try to perform using Recursive Feature Elimination to get feature importance.
  2. Try to use these same approaches as I’ve done but without calibrating it. And then compare them whether a model with calibrated or model without calibrated, which performs better?
  3. Try to combine the undersample and oversample strategy. That means, one of the majority will decrease while other of the minority class increase samples. It doesn’t need to be the same as a number of minority class samples or majority class samples.

11. References

  1. Hyperparameter Tuning (Most Vote Blog: https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)
  2. Just Don’t Overfit (Blog: https://medium.com/analytics-vidhya/just-dontoverfit-e2fddd28eb29)
  3. Don’t Overfit! — How to prevent Overfitting in your Deep Learning Models (Blog: https://nilsschlueter.de/blog/articles/dont-overfit-%E2%80%8A-%E2%80%8Ahow-to-prevent-overfitting-in-your-deeplearning%C2%A0models/)
  4. How to not overfit (Most voted kernel: https://www.kaggle.com/artgor/howto-not-overfit)
  5. AppliedAICourse: www.appliedaicourse.com

Thank you for reading it! Here is my Linkedin profile and Code in Github.

--

--