Agile Machine Learning for Classification — Week 2
In the previous post https://medium.com/@shreesha265/iterative-machine-learning-for-classification-week-1-31049af8655d, I established a quick baseline RandomForestClassifier model that gave close to 80% accuracy on the Test set. In this week, I am going to try and improve upon the model performance by implementing the following:
- Identifying blank spaces and turning them into NaN
- Handling Null values (Missing Indicator + Median Imputation)
- Grouping rare categorical levels into ‘Other’ category
- Discretization for Continuous Variables using K-Means
Then I will do the usual modelling steps and plot the performance metric for binary classification.
Lets begin!
Identifying blank spaces and turning them into NaNs
After importing libraries and the same dataset as before, I wrote a custom function to identify the bad columns. I defined ‘bad’ columns as those columns with NaNs, blanks, constant/quasi constant values or duplicated column values. The function is given below
Once you run this, you will notice that the only column that we need to deal with is the ‘TotalCharges’ column that has 11 blank spaces. What do we do with the blank spaces? In the previous week’s iteration, I just dropped them but now I am going to convert them to NaNs and try a couple of null value imputation methods. This is usually the case in Machine Learning or Engineering in general where we bring the problem into familiar territory so that known techniques can be applied.
Notice how the ‘TotalCharges’ column is now considered as a float ? This makes it easier to handle it later
Before we do any imputation or transformation, we should split it. This ensures that we are fitting with training data and transforming on the test data.
Now we are ready to do some variable transformation. Lets deal with the Null values first.
Handling Null values (Missing Indicator + Median Imputation)
a. Adding a Missing Indicator
Before we do any imputation of nulls, if we donot know whether data is Missing At Random, it is better to introduce another column to capture the missingness. This can be accomplished manually or using the AddNaNBinaryImputer() method from Feature Engine
Notice how there is another column called “TotalCharges_na” that I created ? This column has a 1 whenever the corresponding row in “TotalCharges” is null
Make sure to introduce the binary indicator before any imputation otherwise we will lose information about the missingness. The above code is general enough for both the categorical and numerical variables. If you have multiple columns that got imputed, then we need a round of feature reduction using the constant value removal and correlated features (because its possible that some of the features have missing data that occur at the same time). However for our Telco Customer Churn data, we dont need to worry about Feature Reduction because there is only 1 extra column
b. Median Imputation
Be cautious when imputing nulls because we may end up distorting the distribution. For situations when the number of missing values is <5% and we believe that the data is Missing At Random (MAR), then statistical imputation methods are valid. If not, you may have to try more sophisticated methods (such as building a mini-Random Forest model on the rest of the features to predict the missing values). In this dataset, I am going to do Median Imputation on the continuous variables (only TotalCharges). Note that numeric_col_list contains just the continuous variables in our data and NOT the entire set of numerical features.
Notice that I have added the median imputed column back into the original dataframe and deleted the old TotalCharges column
Reducing Cardinality
Grouping rare categorical levels into ‘Other’ category
This step is critical especially for code base that goes into Production. In the real life, the incoming data may have new column levels that were never seen by the model. If the pre-processing steps dont do anything about it, the model will throw an error complaining about unseen levels. To proactively prevent this, one possible way is to encode any levels that are less frequently occurring into its own category and call it ‘Rare’. This also has the advantage of reducing overfitting (otherwise our tree can grow really deep)
Reduce the cardinality of the feature space to make it easier for tree based algorithms
Q. What would be a good number of levels in the categorical features?
For those features with greater than 2 levels, use an arbitrary threshold like 5% below which even the remaining features will be merged back into the ‘Other’.
Notice, how I did it in 3 steps: I first isolated only those columns that had greater than 2 levels. Then I among those columns, I changed the lowest frequency level name into ‘Rare’. This takes into account those columns where the levels are fairly evenly distributed but we still need to account for unseen levels in the Test set. Finally, I used the RareLabelCategoricalEncoder method from Feature Engine to automatically encode all the column levels below 5% into ‘Rare’.
Encoding of categorical levels
We have one last step before we can fit the training data on the ML model: encoding the text into numbers. This can be accomplished using simple alphabetical encoding of the categorical levels using LabelEncoder or something a bit more sophisticated like using the target mean. We will use the former to illustrate the concept. Note that this method will only work for tree-based models and not for linear models because the latter will misinterpret the encoded levels to be actual numbers rather than mere labels.
The above code block shows how to use the LabelEncoder on each of the columns of interest and turn them into a numerical variable. Note, how I used the transform separately on the Test set.
As expected the data is all numerical
Lets repeat the same with the target column.
I am going to export this intermediate dataset to be used for future weeks of modelling.
(Optional) Discretizing Continuous Variables
This step is optional. The idea is to turn the continuous variable into a discrete variable. For tree-based classification models, this has the potential to make it easy for class separation.
There are several ways of discretizing (both supervised and unsupervised). One option is to arbitrarily bin the continuous features into say, 10 buckets, with values in each bucket having an equal frequency of occurrence. While this is easy to implement, the problem with this approach is that information maybe lost.
The more robust option is to use unsupervised algorithms like K-means individually on each of the continuous features to infer the number of clusters and then bin the features accordingly. This is the approach I am adopting here.
Using Elbow method to determine number of K-Means clusters
One of the key challenges in using K means is that you dont know a-priori how many clusters to choose. So if you dont have guidance from business on how many bins make sense, use the Elbow method. The basic idea is simple. For the columns of interest, iterate through a range of K-means cluster numbers and plot the errors (in this case Sum of Squared Errors — SSE). The cluster number where the SSE drops off and stabilizes will be the one to pick.
Running this code for our continuous variables (tenure, MonthlyCharges and TotalCharges) gives the following graph:
As seen in the above graph, either 3 or 4 clusters seem to give a good enough result beyond which the improvement tapers off. So lets pick 4 clusters for our discretization.
The code assigns each row to a particular bin (either 0, 1, 2 or 3 because we have 4 clusters) and assigns them to a new pandas column in the existing dataframe (this is identified by the ‘_cluster’ suffix I have chosen to give.
If you are curious to know what goes into each of the bins, here is the code for that.
Now that we have the new columns, we can delete the tenure, MonthlyCharges and TotalCharges columns.
Now that we have the data ready, we are going to repeat some of the ML steps from the previous week
Machine Learning
Model Training
I am going to refactor some of the code from the previous week into a separate python file and call the function from the same directory as the notebook. The custom function is given below
Calling the function through the below code will give you a bunch of output metrics and plots
I am going to just display and discuss the Test metrics in the interest of time
The Test set performance is similar to the previous week.
As discussed in the previous week, we want the confusion matrix’s main diagonal to have all the numbers and the off-diagonal elements (False Positives, False Negatives) to be zero. In our case, the model is picking up most of the No’s but less than half of the Churned customers. With a Test accuracy, ROC AUC and F1 score not that different from the previous week, the steps that we adopted hasnt improved the model performance at all.
In the next week, we will try out more advanced Feature Engineering techniques using Feature Tools to see if it helps. Until then stay tuned !
The next week’s article can be found here