Agile Machine Learning for Classification — Week 1

Shreesha Jagadeesh
8 min readSep 29, 2019

--

Insert another cute robot

As discussed in the introduction here, lets get started with a churn prediction dataset from Kaggle to illustrate end-to-end Machine Learning development starting with a Baseline Model in the first week.

Imagine you are a Data Scientist at a Telecommunications company providing various subscriptions plans to customers. You have been tasked with figuring out who are the customers likely to churn. This will enable the Sales & Marketing team to selectively reach out to those customers and get them to stay on. The dataset is available here if you would like to follow along.

https://www.kaggle.com/blastchar/telco-customer-churn

We are going to do some preliminary data exploration and cleaning, then create a Random Forest model to serve as our baseline performance. I have used Python3 in a Jupyter notebook running on Paperspace to code these.

Lets import the minimum packages to get started. We will import more packages later in the code

Define Metadata

Now, define the filepaths, parameters and metadata. I have defined the target variable that we are trying to predict as ‘Churn’.

Load data

Import the csv files as a pandas dataframe and then check out random samples from the dataset alongwith the head and tail (this is to ensure there is nothing unusual at the top or bottom of the input files like stray headers).

Initially loaded pandas DataFrame

Notice how customerID is the index column. In your datasets, make sure to either delete these ID columns or keep them as indices to prevent data leakage. The column names shown above are fairly self explanatory. (there are a few more that I couldnt show because of the limitations of the screenshot).

If you examine the dataset, you will see that there are 20 columns including the target. There are 3 features that are numerical (‘tenure’, ‘MonthlyCharges’ & ‘TotalCharges’) while the rest of them are categorical. We will come back to the encoding later in the notebook

Inspect DataFrame structure

If you run df.info() as shown above, it will print all the features, their associated data types, the total number of non-null values in each feature, the number of rows and the memory usage. Its a quick way to see if there are a) Null values b) columns that have been misinterpreted by pandas

Turns out that the items within the ‘TotalCharges’ feature was being interpreted as an object eventhough we know it has to be a numeric type like int or float. Given that the number of non-null values (7043) matches the number of rows (7043), we can eliminate the presence of null values to be the cause. Could there be strings in the data? Lets print out the unique values

Ah ah ! The numbers are being read as a string by pandas. At this stage, we can either go back and open the file in Excel and manually change it to numbers or do programmatically in Python. I am going to handle this data type conversion later in the notebook because real-life data sets dont tend to be this small and you cant open them in Excel to do it manually.

Checking for Null Value

One of the most crucial things in Data Science is dealing with null values. Check how many features have null values.

This column has a blank space in it

There are no null values in the dataset, however there are blanks in the ‘TotalCharges’ column. In Week2, I am going to present you with code that shows you how to effectively impute null values with Statistical measures such as Mode, etc. But in this week, I will just delete them (later in the notebook)

Checking Unique Values

Figure out how many unique values are there in the dataset to help guide our decision making on what to encode. For example, if a categorical variable has an extremely high cardinality, it would present a problem to tree based algorithms and we would need to compress the levels

Number of unique values in each column

As seen above, the numerical columns such as ‘tenure’, ‘MonthlyCharges’ and ‘TotalCharges’ have a high number of unique values which makes sense. The categorical columns have low cardinality (2, 3 or 4) which is good for classification algorithms

Data Reduction

In this example, there are no troublesome columns (which we can delete by filling in the drop_column_list) such as timeseries/latitude/longitude. But as mentioned before ‘TotalCharges’ has blanks in it. Just delete them for now and we will bring them back in future weeks

Data type conversion

Its quite common that what you load may contain columns with prefixes and suffixes such as ‘$’ dollars or ‘%’ percentages. In these situations, it makes sense to clean these out

Remember the ‘TotalCharges’ column that had strings? I have converted it into float above. The numerical columns are now ‘ready’ for machine learning but we still need to encode the categorical variables.

Now that we have numerical features, can we feed it directly into a tree based model like Random Forest? Yes even without scaling or normalizing. Scaling has the effect of squeezing all data range into a tight range but is not required for Trees

Categorical Encoding of Object columns

Most machine learning algorithm implementations require numerical input which is why we encode them. For tree-based algorithms, it has been observed that one-hot encoding actually makes the performance worse. Its better to just do a simple label encoding

I have identified what the object columns are and then iterated through them, each time converting each categorical level into a number. A snippet of the output looks like below

Categorical encoding of the object columns

Notice how all of the columns are now encoded as numbers? This is what we want. In fact this is true of computer science in general. Computers can only understand numbers, so at some level, anything else (timestamps, words, etc) have to be coded back as numbers

Determining Baseline Accuracy

Before we run our Random Forest or any Machine Learning algorithm, it is important to get an intuitive sense of what the default accuracy will look like. For example, in a perfectly balanced dataset with the distributions of 0s and 1s being 50:50, a randomly guessing model will get 50% accuracy. In general the baseline accuracy is the percentage occurrence of the majority class. If your model is having worse performance than random guessing, then go back to the drawing board and get more data!

Histogram of class distributions

In this particular instance, the baseline accuracy to beat will be 73% (because that is the percentage occurrence of the non-churn class)

Feature-Target & Train-Test Split

Lets split up the cleaned dataframe into X and y.

Sidenote: Have you ever wondered why the feature space is denoted by uppercase ‘X’ while the target class is always referred to as lowercase ‘y’ ? This is because of linear algebra conventions where we denote vectors with lower case letters and multidimensional arrays with caps

I have also split up the train and test based on a 80:20 split. Note: It is very important in real-life to set the seed for reproducibility. Otherwise, the results you get may not be replicated by other people and they may view you with suspicion if they get lower accuracy than you do.

Machine Learning Modelling and Prediction

Finally we get to the exciting part! Lets initialize a Random Forest classifier from sklearn with default parameters and use it to fit on the Train set. Then create predictions on both the Train and Test set to see the accuracy scores

RandomForest classifier with default params

As seen above, the model performs better than the baseline. The noticeably lower score in the Test set compared to the Train set is because of overfitting which we will rectify in the subsequent weeks after we bring up the accuracy first.

For classification problems, we are rarely interested in the accuracy itself. Instead, what we want are other metrics like the F1 score which indicates how well the model performs on each of the classes.

Classification Report showing the F1 Scores

Our model is significantly more accurate in predicting 0 (not a churn) than predicting 1 (churn).

Visualizing Predictions

The above function basically defines the plotting format of the confusion matrix as a 2x2 grid with the correct labels. Note: you can also print the confusion matrix for the Train set but because of the previously identified overfitting, it is an overestimate of the model generalizability

Printing the confusion matrix will make it easier to visualize how well the model is performing with respect to the True Positives/False Negatives

Confusion Matrix for the Test Set

How do you go about interpreting the above matrix? Well, a ‘perfect’ model will have 0s for the lower left (False Negatives) and the top right (False Positives). Numbers will be present only in the main diagonal (top left and bottom right). But our model has missed out on 208 churn customers (FN) which it incorrectly predicted as not-churn. While classifying 92 non-churn customers (FP) as Churn

This balance between FP and FN has business implications. The fundamental tradeoff is the costs in missing out on a potential churn customer vs marketing wrongly to a non-churn customer. For example, if the value of the each lost churn customer is $1000, the gain of 166 customers (bottom right True Positive) will result in a gain = True Positives * Dollar value of True Positive = $166k

If the value of discount being offered by the Marketing to customers identified by the model is $100/customer, then the lost dollar value from the False Positives = False Positives * Dollar cost of False Positive = $9.2k

At this stage, you can go back to the business stakeholders and jointly figure out whether you need to tune the model more or if the results are sufficient to productionalize. There are of course a lot more factors involved in productionalizing the model than what I illustrated above ( I will cover deployment of ML Models in a separate series of articles in November). However, knowing that a default model has a positive savings of about $157k/year is reassuring.

Hope you found this workflow insightful. Thanks for reading this article and please share with your friends and colleagues who might be interested. If you would like the source code, please drop me a message with your email and I will send it to you. Stay tuned for 6 more weeks of Data Science !

The next week’s article can be found here

--

--

Shreesha Jagadeesh

I am an AI leader helping tackle high-impact business problems with the help of cutting edge Personalization techniques. You can connect on LinkedIn