The churn prediction problem — classification made simple !

12 min readOct 1, 2023

One very classical application of machine learning classification algorithms is to predict whether or not a customer will churn.

Oh no! The red customer left. How could we predict it and take measures to avoid it ?

Simply put, we want to know, considering several customer characteristics, if a customer will continue to buy something from our company or not: this is the perfect application of a binary classification algorithm.

In that way, the company can take several measures to ensure that the customers that present these characteristics do not leave, such as promotional emails.

During the third week of the Machine Learning Zoomcamp with the excellent series of videos from Alexey Grigorev in DataTalks.Club, we explore the binary classification problem and learn several key points that I will explain in this article:

The main principle of binary classification, compared to regression
Prepare our data in a relevant way (you will see what we mean by that)
Determine which features are important using the calculation of churn ratio, risk, mutual information and correlation between variables !
How to handle categorical variables (those that are not numbers) with one-hot encoding
Train and interpret our binary classification model: the logistic regression !

That sounds like a lot ! But don’t worry, we’ll keep it simple and easy to read. Feel free to watch the YouTube videos while you’re reading

Ready ? Go !

What is binary classification and what is the difference between it and regression problems ?

Binary classification consists in classifying elements (such as customers in our examples) in two classes (the “binary” term comes from here). In our case, customers are classified as “churn” or “not churn”.

The difference between classification and regression resides in the type of the target value: it is continuous for regression and categorical for classification. If you want to discover more about logistic regression itself, which is the model we will use, you can read this article written by James Thorn or watch this video from Josh Starmer.

Prepare our data

The dataset we will use is available on kaggle: https://www.kaggle.com/blastchar/telco-customer-churn

You can simply download it and place it in the same level as your notebook. Here, I named it “dataset”.

A simple preparation consists in homogenizing column names and values for categorical features:

import pandas as pd

# Load the dataset
df = pd.read_csv('dataset.csv')

# Homogenize column names: we lower the case and replace spaces by underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Select the categorical features: they have the "object" dtype
categorical_features = list(df.dtypes[df.dtypes == "object"].index)

# Homogenize column values for categorical features
for c in categorical_features:
  df[c] = df[c].str.lower().str.replace(' ', '_')

Nice ! However, we have to be careful here, since spaces could be present in numerical features columns and thus be replaced by underscores. This has to be corrected:

# First, we can check the type of each column after data preparation
df.dtypes

Output:

We can notably see that “totalcharges” is indicated as an object (categorical feature). However, we expect numerical values (float). Let’s inspect this column:

df['totalcharges']

Output:

df[‘totalcharges’]: we expect float values but we have an object type column.

We indeed have an ‘object’ type column but we expect only float values. We can check if we replaced spaces (missing values) by underscores:

df['totalcharges'][df['totalcharges'] == "_"]

Output:

Since the column indeed contains underscores that we introduced previously, we can then ignore these values and replace them by 0:

# Force the type of the column, by ignoring the strings
df['totalcharges'] = pd.to_numeric(df['totalcharges'], errors="coerce")

# Replace NaN (underscores here) by zeros
df['totalcharges'] = df['totalcharges'].fillna(0)

Finally, we can transform the churn column values from ‘yes’ and ‘no’ to ‘1’ and ‘0’:

df["churn"] = (df["churn"] == "yes").astype(int)

Feature importance

Churn rate

We can calculate the churn rate for different features. For example, we can determine if the churn rate is different between females and males:

# Calculate churn rates for males and females
churn_male = df["churn"][df["gender"] == "male"].mean()
churn_female = df["churn"][df["gender"] == "female"].mean()

# Print the two values rounded to two digits
print(f"The churn rate for males is equal to {round(churn_male,2)}.")
print(f"The churn rate for females is equal to {round(churn_female,2)}.")

Output:

Churn rates for males and females.

We can see that there is only a negligeable difference between males and females for the churn rate.

This tells us that the gender feature may not be important for our target variable.

We can compare these values with the global churn rate and calculate the difference, for each feature, between global and feature-specific churn rate.

If this difference is superior to 0, this means that the global churn rate is higher than the feature-specific churn rate, which means that these customers are less likely to churn
On the contrary, if the difference is superior to 0, this means that the feature-specific churn rate is higher than the global churn rate: these specific customers are more likely to churn

Risk ratio

The risk ratio consists in dividing the churn rate for a specific feature value by the global churn rate.

For example, we can calculate the risk ratio for customers that have a partner:

# First, calculate the global churn
global_churn = df["churn"].mean()

# Second, calculate the churn rate for customers with and without a partner
churn_with_partner = df["churn"][df["partner"] == "yes"].mean()
churn_without_partner = df["churn"][df["partner"] == "no"].mean()

# Finally, calculate the corresponding risk ratios and print the results
risk_ratio_with_partner = round(churn_with_partner/global_churn,2)
risk_ratio_without_partner = round(churn_without_partner/global_churn,2)

print(f"The risk ratio for customers who have a partner is equal to {risk_ratio_with_partner}.")
print(f"The risk ratio for customers without partner is equal to {risk_ratio_without_partner}.")

Output:

Obtained risk ratios regarding churn for customers with or without a partner.

Since the risk ratio is higher than 1 for people who do not have a partner, we can tell that customers without partner are more at risk to churn than customers that have a partner.

Mutual information

Alexey Grigorev proposed to use the mutual information between two variables, by using the sklearn mutual_info_score. Mutual information is the amount of information we get for a variable from another variable.

We can determine how much information we get for our target variable (churn) and the categorical features:

from sklearn.metrics import mutual_info_score

# Mutual information function between churn and the categorical features:
def mutual_info_churn_score(series):
  return mutual_info_score(series, df["churn"])

# Use the "apply" function to apply the function to categorical features:
mi = df[categorical_features].apply(mutual_info_churn_score)

# Order it to see the features that brings the most information to churn:
mi.sort_values(ascending=False)

Output:

The ordered mutual information scores between categorical features and churn target variable.

Note that here we could remove customerid, churn and even totalcharges, that do not bring valuable information or bring data leakage, to better study the other variables such as contract or onlinesecurity.

Correlation

For numerical variables, we can check the correlation coefficient (and even plot the corresponding correlation matrix!) between churn and them.

If the correlation coefficient is negative (min -1), this means that the churn value increases when the feature value decreases (opposite ways).
If the correlation coefficient is positive (max +1), this means that the churn values increases when the feature value increases (same ways).

Here is the python code I use to plot the correlation matrix:

def plot_correlation_heatmap(dataframe: pd.DataFrame, title: str):
    """
    Plot a triangle correlation heatmap with a color gradient from blue to red, using the data from the provided
    dataframe. The user should precise the plot title.
    Args:
        dataframe: a pandas dataframe containing values we want to correlate and represent their correlation
        title: string, the plot title we want to display
    """

    # Set figure size: modify it here or create new function arguments
    plt.figure(figsize=(12, 6))

    # define the mask to set the values in the upper triangle to True
    mask = np.triu(np.ones_like(dataframe.corr(numeric_only=True), dtype=bool))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    heatmap = sns.heatmap(
        dataframe.corr(numeric_only=True),
        mask=mask,
        cmap=cmap,
        vmin=-1,
        vmax=1,
        annot=True,
        linewidths=0.5,
    )
    heatmap.set_title(title, fontdict={"fontsize": 18}, pad=16)
    plt.show()

We can thus plot the corresponding heatmap for our dataset:

plot_correlation_heatmap(df, "Numerical features correlation")

Output:

Correlation heatmap between the numerical variables of our dataset.

We can see a negative correlation (-0.35) between churn and tenure, which means that customers that have a higher tenure churn less than customers that have a lower tenure value.

A more in-depth study of the correlation between numerical features and variables, by selecting numerical subgroups with specific thresholds, is done in the corresponding Machine Learning Zoomcamp video.

Handle categorical variables with one-hot encoding

One hot encoding consists in transforming the different categorical columns values to one binary column (0:absence and 1:presence) for each categorical column value.

This is a necessary step to make our models understand the values brought by the categorical features.

For example, if we have 3 distinct values in a categorical column, we will create 3 new columns filled with either 0 or 1. There will be a 0 if the value does not apply to the customer or 1 if the value applies.

There are several methods to do that. Here, we use the sklearn DictVectorizer. Note that there exists several ways to encode categorical variables.

First, we need to split our data into training, validation and test datasets (respective proportions: 60%, 20%, 20%):

from sklearn.model_selection import train_test_split

# First generate the full training dataset and the test dataset (20%)
full_train, test = train_test_split(df, test_size=0.2, random_state=42)

# Then separate the full training dataset into train and validation datasets. We have to adjust the proportions:
# we need 20% of the total number of rows from the 80% of values, which means we are looking for 0.25 proportion for the validation dataset
train, val = train_test_split(full_train, test_size=0.25, random_state=42)

Then, do not forget to separate the “churn” variable from the other variables. I use this function to do that, which takes into consideration a potential log transformation of the target (this is not the case here):

def separate_target_from_features(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    test_df: pd.DataFrame,
    target_name: str,
    log_transformation: bool = False,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, np.array, np.array, np.array]:
    """
    From the three datasets (pandas dataframes, train, validation and test) containing all features and the
    target column, separate features and target to return three pandas dataframes containing only features and
    three numpy arrays containing target values. The target is selected according to the provided target_name.

    Args:
        train_df: pandas dataframe containing all training features and target values
        val_df: pandas dataframe containing all validation features and target values
        test_df: pandas dataframe containing all test features and target values

    Returns:
        train: pandas dataframe containing only features values for training dataset
        val: pandas dataframe containing only features values for validation dataset
        test: pandas dataframe containing only features values for test dataset
        y_train: numpy array containing train target values
        y_val: numpy array containing validation target values
        y_test: numpy array containing test target values
    """

    # Select the target values from original datasets
    y_train = train_df[target_name]
    y_val = val_df[target_name]
    y_test = test_df[target_name]

    # If the target value should be log transformed:
    if log_transformation:
        y_train = np.log1p(y_train)
        y_val = np.log1p(y_val)
        y_test = np.log1p(y_test)

    # Remove the target column from features dataset
    del train[target_name]
    del val[target_name]
    del test[target_name]

    return train, val, test, y_train, y_val, y_test

train, val, test, y_train, y_val, y_test = separate_target_from_features(train, val, test, "churn")

Each record is turned into a dictionary. Then, for all records, we fit the DictVectorizer. Finally, we transform our records using the DictVectorizer, which will create the new columns.

from sklearn.feature_extraction import DictVectorizer

# Here is a small example with the training data
train_dicts = train_df.to_dict(orient="records")

# Initiate the DictVectorizer and fit it on the train data dictionary
dv = DictVectorizer(sparse=False)
dv.fit(train_dicts)

# Obtain the transformed features using the fitted DictVectorizer
train_ohe = dv.transform(train_dicts)

Here is the function I use to perform one-hot encoding using DictVectorizer, from three datasets: training, validation and test. These three datasets are obtained right after train/val/test splitting.

def prepare_train_val_test_feature_matrix_onehot(
    train_df: pd.DataFrame, val_df: pd.DataFrame, test_df: pd.DataFrame
) -> Tuple[np.array, np.array, np.array]:
    """
    From a pandas dataframe containing feature data (one feature per column), compute the feature matrix with one-hot encoding
    of the categorical columns. The DictVectorizer is fit on the training data.
    Categorical columns are one hot encoded using the DictVectorizer from sklearn.

    Args:
        feature_df: pandas dataframe containing feature data (one individual per row and one feature per column)

    Returns:
        feature_matrix: a numpy 2D array corresponding to the obtained feature matrix
    """

    # First, create dictionaries from input dataframes obtained after splitting the data
    train_dicts = train_df.to_dict(orient="records")
    val_dicts = val_df.to_dict(orient="records")
    test_dicts = test_df.to_dict(orient="records")

    # Initiate the DictVectorizer and fit it on the train data dictionary
    dv = DictVectorizer(sparse=False)
    dv.fit(train_dicts)

    # Obtain the transformed features using the fitted DictVectorizer
    train_ohe = dv.transform(train_dicts)
    val_ohe = dv.transform(val_dicts)
    test_ohe = dv.transform(test_dicts)

    return train_ohe, val_ohe, test_ohe

The obtained feature matrices can be directly used to train a model. Note that you can give as input your entire dataframe including the numerical variables : they will remain untouched.

from sklearn.feature_extraction import DictVectorizer

train_ohe, val_ohe, test_ohe = prepare_train_val_test_feature_matrix_onehot(train, val, test)

Train and interpret our logistic regression model

In our example, we use the sklearn logistic regression model to predict the churn target variable according to the other features.

Here is the function I use to train a logistic regression model and return the accuracy for the training, validation and test data (it takes as inputs the train, validation and test feature after the above one-hot encoding):

def logistic_regression_accuracy(
    train: np.array,
    val: np.array,
    test: np.array,
    y_train: np.array,
    y_val: np.array,
    y_test: np.array,
) -> Tuple[float, float, float]:
    """
    Computes the accuracy of a logistic regression model trained on the provided training data, for training, validation
    and test data. The provided matrices should already be encoded for categorical variables.
    The logistic regression model has these parameters (fixed for this exercise): solver="liblinear", C=1.0, max_iter=1000, random_state=42.

    Args:
        train: numpy array containing the training dataset values (features only)
        val: numpy array containing the validation dataset values (features only)
        test: numpy array containing the test dataset values (features only)
        y_train: numpy array containing the target values for the training dataset
        y_val: numpy array containing the target values for the validation dataset
        y_test: numpy array containing the target values for the test dataset

    Returns:
        score_train: float, the accuracy score obtained on the training dataset
        score_val: float, the accuracy score obtained on the validation dataset
        score_test: float, the accuracy score obtained on the test dataset
    """

    # Train the model:
    model = LogisticRegression(solver="liblinear", C=10, max_iter=1000, random_state=42)
    model.fit(train, y_train)

    # Accuracy on the train dataset
    score_train = model.score(train, y_train)
    print("Training Accuracy Score", round(score_train, 2))
    # Accuracy on the validation set
    score_val = model.score(val, y_val)
    print("Validation Accuracy Score", round(score_val, 2))
    # Accuracy on the test set
    score_test = model.score(test, y_test)
    print("Test Accuracy Score", round(score_test, 2))

    return score_train, score_val, score_test

Here we used a ‘liblinear’ solver with C=10 and max_iter=1000. These hyperparameters could be optimized according to the capacity to predict the churn class.

This is called hyperparameter optimization and is not covered here, but we will discover that soon enough.

What is also interesting to return is the coefficients of the model.

As for linear regression, each variable has an associated coefficient value that can be interpreted to determine how the feature influences the churn.

from sklearn.linear_model import LogisticRegression

# Train the model:
model = LogisticRegression(solver="liblinear", C=10, max_iter=1000, random_state=42)
model.fit(train_ohe, y_train)

# Get the model coefficients
model.coef_[0].round(3)

Output:

Array corresponding to the weights of each feature, including the one-hot encoded ones.

We can build a dictionary that will contain the feature name as key and the corresponding coefficient as value. To get the feature names, we extract them from the DictVectorizer fitted on the original training data.

# Here we use the train obtained after splitting the original data and removing the target column
train_dicts = train.to_dict(orient="records")
dv = DictVectorizer(sparse=False)
dv.fit(train_dicts)

# Get the dictionary with feature name as key and weight as value
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

Output:

First elements of the obtained dictionary. Each categorical variable is one-hot encoded: we can see that there are three types of contracts, and each one has a distinct weight.

How to interpret those values? Let’s take the contract variable as an example:

For a month-to-month contract, the weight is equal to 0.563, which indicates that the probability of churning for the customers that have this type of contract is potentially higher.
For a one-year contract, the weight is equal to -0.172. The probability of churning for these customers is likely lower than for the customers that have a month-to-month contract.
For a two year contract, the weight is equal to -0.513. The probability of churning for these customers is likely significantly lower than for the other types of contracts.

One way to verify this is to check the probability returned by the model using these weights. This is clearly explained in the corresponding video.

Our binary classification journey with the ML Zoomcamp ends here ! I hope you enjoyed it and found some useful informations. If you want to go deeper, you can check the GitHub repository and watch the corresponding videos on YouTube.

PS: did you like this article ? Check the other one about last week: we talked about linear regression !

The churn prediction problem — classification made simple !

What is binary classification and what is the difference between it and regression problems ?

Prepare our data

Feature importance

Churn rate

Risk ratio

Mutual information

Correlation

Handle categorical variables with one-hot encoding

Train and interpret our logistic regression model

Written by Mélanie Fouesnard