Agile Machine Learning for Classification — Week 4

Shreesha Jagadeesh
7 min readDec 9, 2019

--

After introducing Agile Data Science in our introductory article here, we have built,

a) a quick baseline model in week 1 to figure out if there is enough predictive power in the data to spend more time improving

b) another model in week 2 with cleaner features to see if it improves performance

c) a complex model with synthetic features based on feature engineering followed by feature reduction in week 3

This week, we are going to perform hyperparameter tuning using Grid Search to optimize our RandomForest binary classification model.

This is one of the most time-consuming and tedious step because you dont know a-priori the hyperparameter search space. This is also the step where having a powerful computer to run through the computations can make a difference between trying out a lot of different combinations vs just a limited set when you dont have business intuition.

To increase our chances of finding the best hyperparam combo, we are going to plot learning curves to figure out if the current model is overfitting or underfitting. This will guide our decisions on which way to tune the model.

Lets get started !

Loading Data

Lets load the cleaned dataset from Week 2 because the feature engineering steps in Week 3 were not effective.

Loaded dataframe showing sample rows

Train Test Split

As before, lets do a Train-Test split after separating out the Features and Target

Learning Curve

Remember our objective is to get a final set of hyperparams that reduce overfitting while increasing the classification F1 score. How does our default RandomForest() perform so far when it comes to overfitting? From our previous week’s results, we know that there is a gap between the Training set and the Validation scores. But will having more data make a difference by reducing this gap or will changing the hyperparameter to reduce overfitting make a difference ? Lets find out by creating Learning Curves.

The intuition behind these curves is that when models are overfitting (ie have a lot of variance), their training accuracy will be significantly higher than the Validation Accuracy. Those without overfitting will have the Training Accuracy converge towards the Validation Accuracy as the number of examples increases. On the other hand, those models that have a lot of bias will have both their train and validation set scores higher.

Sklearn has an excellent page on the use of Learning Curves to compare algorithm performance here which I have modified below:

Imports and function to generate learning curves

I am going to plot the learning curve with the default hyperparams from our out-of-the-box RandomForestClassifier() vs. another learning curve where the estimator has a max_depth hyperparam to penalize overfitting.

Generating Learning Curves for the out-of-the-box RandomForest
Learning curve for Random Forest estimator on the training data where there is overfitting

Notice how the Training score stays close to 1.0 (implying near 100% accuracy but the validation scores stay below 0.80. this large gap indicates overfitting occurs with the default hsyperparams (The default params can be seen by clicking Shift+tab after placing the cursor inside with the estimator’s paranthesis)

In contrast, lets plot the RandomForestClassifier() with a stricter criteria than what was out-of-the box. Instead of the default max_depth=None, I have chosen max_depth=5 and re-plotted the learning curve

Training and Validation scores converging indicating that there is no overfitting

Why does restricting the max_depth to just 5 levels deep prevent the RandomForest from overfitting? This is because the shallower the tree, the fewer ‘cuts’ it makes to the data and hence learns only the broader patterns in the data rather than the noise.

Is there a more systematic way to pick the right depth instead of randomly testing out a few numbers? Yes, there are many different ways of picking the right set of hyperparameters from a search space including Random Search, Grid Search, Bayesian Search. Depending on the time available for model development and how accurate you want the performance to be, you can pick one of the above for tuning. You can refer to Will Koehrsen’s excellent in-depth analysis of Model Tuning here for more information. The rest of the tutorial focusses on Grid Search to search the hyperparam space. In later weeks, we will cover more advanced Bayesian Search.

Grid Search for Hyperparameter Tuning

Out of the many hyperparams available in RandomForest, I am going to pick 3 hyperparams for reducing variance (ie overfitting) and 1 hyperparameter for reducing the bias (underfitting). Those are max_depth, min_samples_split and min_samples_leaf for reducing variance and class_weight for improving the accuracy. The class_weight is especially interesting because we have a slight class imbalance but not too severe that we need to use more advanced techniques yet.

Hyperparameter space for Grid Search

I have picked relatively low numbers for the max_depth. I am not too worried about the min_samples_split & min_samples_leaf because we saw previously how the learning curve converges without a problem even with the defaults. On the other hand, class_weight default weightage is 1 for each of the class labels. Choosing ‘balanced’ will tell sklearn to rebalance the classes to 50:50 during training. I have also included dictionaries indicating other weightage options for the positive class that GridSearchCV will iterate through

Below is a function that takes in the train set features, labels, hyperparameter search space, the estimator and whether the search type is random_search or grid_search. At the end of the day, hyperparam tuning is essentially good old trial and error — try many things and keep the ones that work !

Hyperparam tuning using either Grid Search or Random Search

The core of the function performs a search through all possible combinations of the parameter space that is passed (700 possible combinations of them). This is done 5 times to average out any quirks in the data. So we have a total of 700 * 5 = 3500 training and validation within the GridSearchCV function. The hyperparams for the best validation scores can be obtained using the .best_params_ method. I have also optionally stored the validation score results for the model containing the best_params_.

Note that the metrics_store_function() code snippet is given below

The above function is collecting the usual classification metrics in a dictionary and returning it inside the hyper_param_search_function().

Do you know why I am not using the best_estimator_ model as my tuned model? Because it was trained on a subset of the training data and not the whole. Later, we will use the whole training set with the best_params_ and then predict on the test set.

The function is general enough for the reader to try Random Search instead of Grid Search. The difference between Random search and Grid Search is that the Grid Search is exhaustive while the Random Search only picks a subset of the combinations to go through to save time. For larger hyperparam space, you could have a quick Random Search on a smaller subspace and then do a more rigorous Grid Search on the narrower results.

Here I am going to try just the Grid Search by calling the function

Calling the function to perform hyperparameter tuning using GridSearchCV
Hyperparam search logs

If you want to sneak into how your hyperparameters look like, the following will generate the output:

Tuned RandomForest Hyperparam results

Machine Learning on Tuned Hyperparameters

Model Fitting

As in previous weeks, we are going to instantiate the RandomForestClassifier() but now we have the newly found tuned hyperparameters to override the defaults.

Fit the classifier using the previously found hyperparams

When running the fitted model on the Test Set, we see a noticeable improvement in both the F1 scores and the ROC AUC

Success !

Classification metrics for Telco Churn

Compared to the previous week when the ROC AUC was around 79%, this week, we have improved it to 84% ! Moreover, our F1 scores for the Positive class also improved by almost 10%. Much of it can be attributed to the restricting the max_depth and changing the class_weight to allow for the imbalanced) positive class to be easily picked up.

At this stage, you can claim victory and go back to the business to present your results. If you have more time or the business wants you to try harder still, then you can try Ensemble Modelling which we will discuss in Week 5. Stay tuned !

--

--

Shreesha Jagadeesh

I am an AI leader helping tackle high-impact business problems with the help of cutting edge Personalization techniques. You can connect on LinkedIn