Using GCP’s AI Platform to Predict Customer Churn
Developing a classification model to address customer churn — by Jessica Jestings & Venky Bharadwaj
One of the most important tasks a company faces is not just acquiring new customers but retaining existing customers. Customer retention is critical for a company’s growth and success, as the cost of acquiring a new customer is often greater than retaining an existing customer.
Customer churn can be addressed by 1) predicting at-risk customers and intervening with proactive support, and 2) learning what is driving customers to leave and developing strategies to prevent future churn. We used Google Cloud Platform (GCP) to develop a customer churn model and Looker to explore the results to understand the key drivers of churn for a Telecom company. We leveraged the following services in our solution:
- Google Cloud Storage – A scalable object storage service
- Google BigQuery – A fully managed, serverless, pay-as-you-go data warehouse solution
- Google AI Platform – A one-stop shop to build and deploy models within the GCP infrastructure
- Looker – A web-based visualization tool
Telecom Customer Data
For this use case, we used customer data from a Telecom company. The data is a one-month snapshot of customer information, including customers who left in the last month, registered services (speed, contract length), and account information (tenure, monthly payments, support interactions, total spend). The dataset represents three sources: Service (Internet, Phone, Security, etc.), Account, and Customer Demographic information. Using this data, we built a logistic regression model to predict whether a customer would churn (yes/no) in the following month.
AI Platform
The bulk of our solution leverages GCP’s AI Platform, which provides a portal into GCP’s suite of machine learning services. The AI Platform has six main components: AI Hub, Data Labeling, Notebooks, Jobs and Models. This example utilizes the Notebooks and Models capabilities of GCP’s AI Platform.
- Notebooks — Provides the ability to spin up JupyterLab servers, pre-built with all the general machine learning frameworks needed. Enables scaling up or down on hardware, connecting to a compute cluster, and connecting to other services within the GCP ecosystem.
- Models — Provides a model repository for model version control and monitoring model deployments and availability. Enables endpoint setup to allow models to be called in serverless functions.
GCP Pipeline & Process Overview
Google Cloud Platform is easy to navigate and thread different services together to create a pipeline for any analytics project. Our process included four major steps: landing the data into Google Cloud Storage, loading the data into BigQuery, building a model in AI Platform, and visualizing the data in Looker.
Step 1: Clean & Land the Data
We landed the dataset in Google Cloud Storage (GCS). GCS is often used as a data lake and houses raw data. In production, this raw data would be collected and stored from a variety of business workstreams. We used an AI Notebook environment to write a script that merges and cleans the data to meet Analytics & Data Warehousing Standards. This data is then staged back into GCS as a csv file.
Step 2: Load Data into BigQuery
BigQuery is GCP’s serverless, petabyte-scale, pay-as-you-go data warehouse solution. The BigQuery console allows us to select our staged data in GCS. BigQuery can auto-detect the schema of our file, making this just a point-and-click load of our data into this service.
Step 3: Building the Model
The AI Notebook environment provides a BigQuery extension as part of the GCP Python SDK. We used this client to query against BigQuery and load the data locally into a Pandas data frame. BigQuery maintains a single source of truth for the business, rather than needing to load intermediate stage data from GCS. With data available in a Pandas data frame, we can iterate quickly and execute our model experiment design local to the AI Notebook environment.
We approached customer churn as a binary classification problem: churn or no-churn. We used the SciKit-Learn framework to create a model pipeline object containing a preprocessor and a logistic regression predictor.
The preprocessor performs imputation and dummy encoding of categorical variables, as well as standardizing the continuous numeric features. The preprocessed output feeds into a logistic regression model. We chose logistic regression due to its interpretability to understand the main drivers of customer churn.
The classifier performs best when tuned with the correct cost function, which relates the scoring of predicted results against actual results. For example, the benefit of a true positive prediction (customer predicted to churn who churns) versus the cost of a false negative (customer predicted to stay but churns). For logistic regression models, this involves adjusting the scoring threshold parameter, and choosing the threshold that scores the best in the cost function. Matplotlib allows us to plot our confusion matrix, helping to guide our intuition on how the model is scoring and its overall bias.
After model exploration and tuning, the SciKit-Learn model pipeline is exported via the JobLib External module to maintain reproducibility.
Step 4: Deploying the Model
With the exported model object saved in GCS, we are now ready to deploy the model as a serverless function via the Models section of the AI Platform. To deploy the model, we need to supply the following: object store directory, RunTime environment, custom set up modules and the model name. The ML Model is then callable via an API call, making it accessible as an independent microservice.
The term model in the AI Platform is a reference to the endpoint to be called for different business scenarios. This feature provides the ability to independently manage version control for various models.
After calling the model, we can write the inferences back to BigQuery via the BigQuery Client Library. The results of the customer churn model are then available for the business users via a Looker Dashboard to review and decide what preventative measures to implement.
Step 5: Tracking Customer Churn
In order to easily track and manage customer churn, we created a Looker dashboard with the model output and source data by connecting directly to BigQuery. The dashboard provides business users insights into the customers at risk of churn along with the customer information.
The Looker to BigQuery connection is enabled by creating a Looker-specific service account. The service account has access to directly query from and write temporary tables to BigQuery. With a live connection, the dashboard is able to display real-time results from when the model returns new output.
This dashboard highlights customers who are predicted to churn and allows users to dig into detailed account-level information. Overall, this dashboard enables a team to gain insights on at-risk customers and develop strategies to prevent future churn.
Leveraging GCP and Looker, businesses can quickly gain insights about their customer churn to develop effective mitigation strategies. GCP provides tools to enable model development to provide churn predictions, and Looker allows for fast and easy dashboarding to display model results. Providing this data allows users to understand the drivers of churn, leading to better business decisions and customer retention. Identifying and addressing churn is key for success in many industries — subscription services, B2B sales, retailers — and can even be leveraged internally to improve employee satisfaction.
Jessica Jestings (L) and Venky Bharadwaj (R) are Consultants in Slalom’s Data & Analytics practice in Atlanta.
Slalom is a modern consulting firm focused on strategy, technology, and business transformation.