Churn Prediction with the Snowflake Cortex Classification ML Function

Continuing my series on using Snowflake Cortex ML functions directly from Snowflake worksheets

Photo by Sangga Rima Roman Selia on Unsplash

The latest addition to the Snowflake Cortex family of ML functions is the Classification function. According to the documentation, it uses machine learning algorithms to sort data into different classes using patterns detected in training data. Binary classification (two classes) and multi-class classification (more than two classes) are supported.

The Snowflake Cortex Classification ML function is not to be confused with Snowflake’s data classification functionality under the data governance category. Data classification is a process that associates Snowflake-defined system tags to columns by analyzing the fields and metadata for personal or sensitive data. In this post, we are discussing machine learning with the Classification ML function, not the data governance functionality.

A common use case for the binary classification model is churn prediction. Churn prediction can be applied in different industries to identify customers who are likely to cancel their subscription, close their account, or simply abandon using the service. A churn prediction model is trained on historical data about customer characteristics and behavior with a flag indicating whether the customer churned — this is why the model is binary, because the flag can have two values, TRUE or FALSE.

The trained model can then be used to predict how likely an existing customer will churn. The results serve as an advance warning to the organization so that customers who have a high probability of churn are identified early. The organization can offer such customers additional benefits that could deter their churn intention.

To demonstrate the churn prediction model using the Snowflake Cortex Classification ML function, I searched for a suitable dataset with historical data on which to train the model. I found the Telco Customer Churn dataset on Kaggle which I will use to train a model that predicts the probability of churn in the telco industry.

Preparing the environment and sample data

To prepare an environment for working with the functionality, I created a custom role, a warehouse, a database, a schema, a stage, and granted privileges to the newly created role to enable it to create classification models in a schema:

-- create a custom role
use role useradmin;
create role data_scientist;

-- grant the role to myself so I can use it
grant role data_scientist to user <my current user>;

-- create a warehouse, database, schema, and grant privileges
use role sysadmin;
create database ml_classification_db;
grant usage on database ml_classification_db to role data_scientist;
create schema churn;
grant all on schema churn to role data_scientist;
create warehouse data_science_wh with warehouse_size = 'xsmall';
grant usage on warehouse data_science_wh to role data_scientist;

-- grant the classification model privilege
grant create SNOWFLAKE.ML.CLASSIFICATION
on schema ml_classification_db.churn
to role data_scientist;

-- continue working with the newly created role
use role data_scientist;
use warehouse data_science_wh;
use database ml_classification_db;
use schema churn;

-- create an internal stage
create stage churn_stg;

I manually uploaded the Telco Customer Churn csv file to the internal stage. Then I created a csv file format, a target table named CHURN_TBL to store the data from the csv file using the infer_schema function, and copied the data from the file into the table:

-- create a file format
create or replace file format my_csv_format
type = 'csv',
parse_header = true;

-- create the target table using the INFER_SCHEMA function
create or replace table churn_tbl
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location => '@churn_stg',
file_format => 'my_csv_format',
ignore_case => TRUE
)
)
);

-- copy the data from the staged files into the target table
copy into churn_tbl
from @churn_stg
match_by_column_name = case_insensitive
file_format = 'my_csv_format';

Let’s view the data we are working with by selecting from the CHURN_TBL table:

Sample data in the CHURN_TBL table (first set of columns)
Sample data in the CHURN_TBL table (second set of columns)

The data contains a row for each customer, identified by CUSTOMERID, with attributes such as gender, a flag indicating whether the customer is a senior citizen, whether the customer has dependents, the tenure in months, whether the customer uses any of the services provided by the telco operator, the type of contract, the payment method, the monthly charges, the total lifetime charges, and so on.

The last column, named CHURN, contains values TRUE or FALSE indicating whether the customer churned. This column, often referred to as a label or a target in machine learning, is crucial for the model training because it distinguishes between the customers who churned and those who didn’t.

Let’s take a quick look into how many examples of each label we have:

select churn, count(*) as cnt
from churn_tbl
group by churn;
Number of customers that didn’t churn (FALSE) and those who did (TRUE)

From these numbers we see that roughly 25% of the customers churned. Realistically, this is a quite high percentage, usually not seen in the real world where fewer customers churn. But the high percentage of churn examples is beneficial when training the model, so that we have sufficient data for training on each value of the label.

Additionally, the dataset used in this example is anonymized and simplified for learning. It contains information about the customer characteristics, but is lacking information about customer behavior. In real world scenarios, customer behavior is usually more indicative of whether a customer is likely to churn. For example, customers stop using services gradually over time, until they finally churn. Rarely do they churn based on characteristics, such as the payment method, their gender, or which telco services they use.

In real world scenarios, the training data would include many additional columns containing behavioral information such as the volume of use of each of the services in the previous months, flags indicating whether the customer changed any of their characteristics recently, such as changing the payment method, and so on.

But for now, to demonstrate how to train the model and use it to predict churn, we can use the sample dataset.

After training the model, we want to test it to see how well it predicted churn. We will split the data into a training and test dataset in a ratio of 70:30, meaning that 70% of the sample data will be used for training the model and 30% for testing.

Let’s do some data preparation for modeling. We will use the SAMPLE function to create a table named CHURN_TBL_TRAIN that contains a 70% sample of the data from the CHURN_TBL table for training:

create table churn_tbl_train as
select *
from churn_tbl
SAMPLE (70);

Then we will create a table named CHURN_TBL_TEST with the remaining data for testing:

create table churn_tbl_test as
select *
from churn_tbl
where customerid not in (select customerid from churn_tbl_train);

We want to exclude the column CUSTOMERID from the training data because this column is not relevant for training the model. We will create a view named CHURN_TRAIN that excludes this column:

create view churn_train as 
select * exclude (customerid) from churn_tbl_train;

With the training and test data prepared, we are ready to train the model.

Training the classification model

We will train a model named CHURN_MODEL by calling the SNOWFLAKE.ML.CLASSIFICATION function and providing the CHURN_TRAIN view as the input data and the CHURN column name as the target column for the parameters:

create SNOWFLAKE.ML.CLASSIFICATION churn_model (
input_data => SYSTEM$REFERENCE('view', 'churn_train'),
target_colname => 'churn'
);

Once we have the model created, we can use it to predict the probability of customer churn on the test data and compare the predicted and the actual values to determine the model accuracy.

To predict the probability of customer churn, we can call the PREDICT method of the CHURN_MODEL on the CHURN_TBL_TEST test data, using the OBJECT_CONSTRUCT() function to combine all columns:

select churn_model!PREDICT(
INPUT_DATA => object_construct(*))
as prediction
from churn_tbl_test;

We can use the entire CHURN_TBL_TEST table, without removing the CUSTOMERID or the CHURN columns, because the PREDICT method of the classification model considers only the columns that were used in training the model and ignores the rest.

The prediction results are a series of JSON documents that look something like this:

Sample output of the PREDICT method of the classification model

Each JSON document contains the “class” key with a value of TRUE or FALSE, indicating whether the customer is likely to churn. In addition, the probabilities for churn or not churn are provided in the “probability” key.

This output is not very helpful because we don’t know which customer received which probability score. Instead of just calling the PREDICT method of the model, we can join the output of the method with the CHURN_TBL_TEST table and save the result for further analysis by creating a table named MODEL_OUTPUT, like this:

create table model_output as
select *,
churn_model!PREDICT(
input_data => object_construct(*))
as prediction
from churn_tbl_test;

Since we already know that the model outputs the results in JSON format in the PREDICTION column, we can parse the column to select the CLASS, PROBABILITY_TRUE and PROBABILITY_FALSE from this column:

select *,
prediction:"class"::boolean as class,
prediction:"probability"."False" as probability_false,
prediction:"probability"."True" as probability_true
from model_output;

The output from this query gives all the columns from the CHURN_TBL_TEST table and additional columns that the model predicted. The CLASS column contains a value of TRUE or FALSE, indicating whether the customer is likely to churn, as predicted by the model. Here is some sample output:

Sample output from the MODEL_OUTPUT table, comparing the actual (column CHURN) and predicted (column CLASS) churn flags

Evaluating the model

We can compare the CHURN column, representing the actual churn flag with the CLASS column, representing the predicted churn flag to see how well the model performed.

Most of the actual and predicted churn flags are the same, implying that the model did well. However, there are a few differences, for example in the highlighted row in the sample output, where the actual flag is TRUE but the predicted flag is FALSE. In the same row, we see that the probabilities of TRUE or FALSE are both close to 0.5, meaning that the customer has an approximately equal probability of churning or not churning, and the churn flag could go either way.

To get a better understanding of how well the model predicted churn, we can create a confusion matrix. This is a table, often used to summarize and visualize the performance of a classification model in machine learning.

We can do this easily in the worksheet by selecting from the MODEL_OUTPUT table, choosing Chart in the results pane, Heatgrid as the chart type, columns CHURN and CLASS as the rows and columns variables in the chart and count(CUSTOMERID) as the cell value. Here is the result:

Confusion matrix from the MODEL_OUTPUT table

We can see from the confusion matrix that the model correctly predicted that 1410 customers who did not churn, will not churn (such results are referred to as true negatives) as well as that 285 customers who churned are likely to churn (true positives). But it incorrectly predicted that 171 customers will churn who actually did not (false positives) and that 270 customers will not churn who actually did (false negatives).

How do we know if these results are good enough that we can use the model to entice the customers who are likely to churn with offers to help them change their minds?

We can calculate metrics that help us to evaluate the model using more or less complicated formulas (some examples here). But we don’t have to do any manual calculations because Snowflake provides evaluation methods that can be called after the model is trained. For example, we can call the SHOW_EVALUATION_METRICS() method that returns evaluation metrics for each class (TRUE and FALSE):

call churn_model!SHOW_EVALUATION_METRICS();

The results from this method look like this:

Evaluation metrics of the churn model

The meaning of some of the metrics is:

  • precision: the percentage of predictions made by the model that are correct, the formula is (true positives)/(true positives + false positives)
  • recall: the percentage of data points that were correctly identified by the model, the formula is (true positives)/(true positives + false negatives)
  • f1: combines the precision and recall metrics to measure the model’s accuracy. It is a reliable metric only if the dataset is class-balanced, meaning that there are roughly the same number of rows with TRUE and FALSE labels. The formula is 2*(true positives)/(2*(true positives) + false positives + false negatives).

You can find further explanations of the metrics in the Snowflake documentation here.

Generally speaking, a higher value of any of these metrics means the model performance is better. As we can see, the values in the FALSE class are higher than the values in the TRUE class, indicating that the model performs better in predicting customers who will not churn as compared to predicting customers who will churn. This makes sense, since we have many more rows for customers who didn’t churn in the training data, allowing the model to train on a larger amount of data.

There are other methods that we can execute on the trained model, such as the SHOW_FEATURE_IMPORTANCE() method that shows the relative importance of each feature in the trained model:

Output from the SHOW_FEATURE_IMPORTANCE() method

From this output we see that the monthly charges and the tenure are features that influence whether a customer will churn the most. Let’s see how these values are represented in the data by calculating the average monthly charges and the average tenure for each group of customers, those who churned and those who didn’t:

select churn, 
avg(monthlycharges) as avg_monthly_charges,
avg(tenure) as avg_tenure
from churn_tbl
group by churn;
Comparison of average monthly charges and average tenure between customers who churned and those who didn’t

We see that customers with a larger tenure and lower monthly charges are less likely to churn.

Another evaluation method is the SHOW_CONFUSION_MATRIX() method that returns results in the form of a confusion matrix, so that we don’t have to calculate them ourselves like we did earlier.

The SHOW_THRESHOLD_METRICS() method is also useful in evaluating the model because it provides data that can be used to plot the ROC graph. This graph shows the performance of a classification model by plotting the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds:

ROC graph of the model for customers who churned

Because we currently have just one model, the ROC graph doesn’t tell much. It’s more useful when we train several models as it helps us identify the best one. A model is better when it has a larger area under the ROC curve.

Improving the model

Like with any machine learning model, we want to achieve the best possible results and there are always ways to improve the model to give more accurate results.

Although the Snowflake Cortex ML Classification model currently doesn’t allow choosing or modifying the classification algorithm and we can’t adjust any model parameters, there are still ways that we can improve the model accuracy in how we gather and prepare the training data.

Here are some ideas that might help to improve the model:

  • Use more data. This one is obvious, since the more data we have, the better the model can learn.
  • Use more features. As mentioned earlier, we are working with sample data that contains a limited number of features. More behavioral data would probably benefit the model training.
  • Clean the data. Look for outliers and missing data and treat accordingly. You may want to remove obvious outliers that could negatively affect model training as well as remove or impute missing values.
  • Add features calculated from existing features. For example, you could calculate average amounts over the previous time periods or ratios between features.
  • Remove features that are not relevant for the model. For example, we removed the CUSTOMERID column because it has a high cardinality and does not affect a customer’s intention to churn.
  • Balance the classes. When you have unbalanced classes in the training data, you might include only a sample of the rows in the larger class to make it more even with the smaller class.

With each attempt to improve the model, we must examine the metrics carefully and evaluate if the model is indeed more accurate.

I’m Maja Ferle, Snowflake Data Superhero and a senior consultant at In516ht. You can get in touch with me on LinkedIn.

--

--