Agile Machine Learning for Classification — Week 2

8 min readOct 20, 2019

In the previous post https://medium.com/@shreesha265/iterative-machine-learning-for-classification-week-1-31049af8655d, I established a quick baseline RandomForestClassifier model that gave close to 80% accuracy on the Test set. In this week, I am going to try and improve upon the model performance by implementing the following:

Identifying blank spaces and turning them into NaN
Handling Null values (Missing Indicator + Median Imputation)
Grouping rare categorical levels into ‘Other’ category
Discretization for Continuous Variables using K-Means

Then I will do the usual modelling steps and plot the performance metric for binary classification.

Lets begin!

Identifying blank spaces and turning them into NaNs

After importing libraries and the same dataset as before, I wrote a custom function to identify the bad columns. I defined ‘bad’ columns as those columns with NaNs, blanks, constant/quasi constant values or duplicated column values. The function is given below

Once you run this, you will notice that the only column that we need to deal with is the ‘TotalCharges’ column that has 11 blank spaces. What do we do with the blank spaces? In the previous week’s iteration, I just dropped them but now I am going to convert them to NaNs and try a couple of null value imputation methods. This is usually the case in Machine Learning or Engineering in general where we bring the problem into familiar territory so that known techniques can be applied.

Notice how the ‘TotalCharges’ column is now considered as a float ? This makes it easier to handle it later

Before we do any imputation or transformation, we should split it. This ensures that we are fitting with training data and transforming on the test data.

Now we are ready to do some variable transformation. Lets deal with the Null values first.

Handling Null values (Missing Indicator + Median Imputation)

a. Adding a Missing Indicator

Before we do any imputation of nulls, if we donot know whether data is Missing At Random, it is better to introduce another column to capture the missingness. This can be accomplished manually or using the AddNaNBinaryImputer() method from Feature Engine

Notice how there is another column called “TotalCharges_na” that I created ? This column has a 1 whenever the corresponding row in “TotalCharges” is null

Newly introduced TotalCharges_na column that has the missing value indicator for TotalCharges. This will capture the ‘missingness’ and is useful when the data is Missing Not At Random (MNAR)

Make sure to introduce the binary indicator before any imputation otherwise we will lose information about the missingness. The above code is general enough for both the categorical and numerical variables. If you have multiple columns that got imputed, then we need a round of feature reduction using the constant value removal and correlated features (because its possible that some of the features have missing data that occur at the same time). However for our Telco Customer Churn data, we dont need to worry about Feature Reduction because there is only 1 extra column

b. Median Imputation

Be cautious when imputing nulls because we may end up distorting the distribution. For situations when the number of missing values is <5% and we believe that the data is Missing At Random (MAR), then statistical imputation methods are valid. If not, you may have to try more sophisticated methods (such as building a mini-Random Forest model on the rest of the features to predict the missing values). In this dataset, I am going to do Median Imputation on the continuous variables (only TotalCharges). Note that numeric_col_list contains just the continuous variables in our data and NOT the entire set of numerical features.

Notice that I have added the median imputed column back into the original dataframe and deleted the old TotalCharges column

Reducing Cardinality

Grouping rare categorical levels into ‘Other’ category

This step is critical especially for code base that goes into Production. In the real life, the incoming data may have new column levels that were never seen by the model. If the pre-processing steps dont do anything about it, the model will throw an error complaining about unseen levels. To proactively prevent this, one possible way is to encode any levels that are less frequently occurring into its own category and call it ‘Rare’. This also has the advantage of reducing overfitting (otherwise our tree can grow really deep)

Reduce the cardinality of the feature space to make it easier for tree based algorithms

Q. What would be a good number of levels in the categorical features?

For those features with greater than 2 levels, use an arbitrary threshold like 5% below which even the remaining features will be merged back into the ‘Other’.

Notice, how I did it in 3 steps: I first isolated only those columns that had greater than 2 levels. Then I among those columns, I changed the lowest frequency level name into ‘Rare’. This takes into account those columns where the levels are fairly evenly distributed but we still need to account for unseen levels in the Test set. Finally, I used the RareLabelCategoricalEncoder method from Feature Engine to automatically encode all the column levels below 5% into ‘Rare’.

The least frequently occurring level or all those levels with less than 5% occurrence in the categorical columns have been converted into Rare

Encoding of categorical levels

We have one last step before we can fit the training data on the ML model: encoding the text into numbers. This can be accomplished using simple alphabetical encoding of the categorical levels using LabelEncoder or something a bit more sophisticated like using the target mean. We will use the former to illustrate the concept. Note that this method will only work for tree-based models and not for linear models because the latter will misinterpret the encoded levels to be actual numbers rather than mere labels.

The above code block shows how to use the LabelEncoder on each of the columns of interest and turn them into a numerical variable. Note, how I used the transform separately on the Test set.

As expected the data is all numerical

Snapshot of the feature column showing encoded feature space with no text

Lets repeat the same with the target column.

I am going to export this intermediate dataset to be used for future weeks of modelling.

(Optional) Discretizing Continuous Variables

This step is optional. The idea is to turn the continuous variable into a discrete variable. For tree-based classification models, this has the potential to make it easy for class separation.

There are several ways of discretizing (both supervised and unsupervised). One option is to arbitrarily bin the continuous features into say, 10 buckets, with values in each bucket having an equal frequency of occurrence. While this is easy to implement, the problem with this approach is that information maybe lost.

The more robust option is to use unsupervised algorithms like K-means individually on each of the continuous features to infer the number of clusters and then bin the features accordingly. This is the approach I am adopting here.

Using Elbow method to determine number of K-Means clusters

One of the key challenges in using K means is that you dont know a-priori how many clusters to choose. So if you dont have guidance from business on how many bins make sense, use the Elbow method. The basic idea is simple. For the columns of interest, iterate through a range of K-means cluster numbers and plot the errors (in this case Sum of Squared Errors — SSE). The cluster number where the SSE drops off and stabilizes will be the one to pick.

Running this code for our continuous variables (tenure, MonthlyCharges and TotalCharges) gives the following graph:

Elbow method to determine appropriate K-Means clusters for each continuous variable

As seen in the above graph, either 3 or 4 clusters seem to give a good enough result beyond which the improvement tapers off. So lets pick 4 clusters for our discretization.

The code assigns each row to a particular bin (either 0, 1, 2 or 3 because we have 4 clusters) and assigns them to a new pandas column in the existing dataframe (this is identified by the ‘_cluster’ suffix I have chosen to give.

The columns with the _cluster suffix have been generated because of our K means code snippet above

If you are curious to know what goes into each of the bins, here is the code for that.

Summary stats for the discretized columns based on the K-Means

Now that we have the new columns, we can delete the tenure, MonthlyCharges and TotalCharges columns.

Now that we have the data ready, we are going to repeat some of the ML steps from the previous week

Machine Learning

Model Training

Initializing and Fitting a RF model with default params

I am going to refactor some of the code from the previous week into a separate python file and call the function from the same directory as the notebook. The custom function is given below

Calling the function through the below code will give you a bunch of output metrics and plots

I am going to just display and discuss the Test metrics in the interest of time

The F1 score and the ROCAUC are key headline numbers

The Test set performance is similar to the previous week.

As discussed in the previous week, we want the confusion matrix’s main diagonal to have all the numbers and the off-diagonal elements (False Positives, False Negatives) to be zero. In our case, the model is picking up most of the No’s but less than half of the Churned customers. With a Test accuracy, ROC AUC and F1 score not that different from the previous week, the steps that we adopted hasnt improved the model performance at all.

In the next week, we will try out more advanced Feature Engineering techniques using Feature Tools to see if it helps. Until then stay tuned !

The next week’s article can be found here

Agile Machine Learning for Classification — Week 2

Reducing Cardinality

(Optional) Discretizing Continuous Variables

Machine Learning

Written by Shreesha Jagadeesh