How to build an end-to-end propensity to purchase solution using BigQuery ML and Kubeflow Pipelines

Published in

Google Cloud - Community

13 min readSep 8, 2020

Propensity to purchase use case is widely applicable across many industry verticals such as Retail, Finance and more. In this article, we will show you how to build an end to end solution using BigQuery ML and Kubeflow Pipelines (KFP, an Machine Learning Operations (MLOps) workflow tool) using a Google Analytics dataset to determine which customers have the propensity to purchase. You could use the solution to reach out to your targeted customers in an offline campaign via email or postal channels. You could also use it in an online campaign via on the spot decision, when the customer is browsing your products in your website, to recommend some products or trigger a personalized email for the customer.

Propensity to purchase use case is a subset of personalization use case. It is a key driver of how many organizations do marketing today. In today’s changing times, you need to ensure that you are targeting the right messages to the right customers at the right time. “Personalization at scale has the potential to create $1.7 trillion to $3 trillion in new value” (McKinsey study). Propensity modeling helps companies to identify these “right” customers and prospects that have a high likelihood to purchase a particular product or service.

Photo by rupixen: https://unsplash.com/photos/Q59HmzK38eQ

Propensity models are important as it is a mechanism for targeting sales outreach with personalized messages as they are keys to the success of getting attention of the customers. By using a propensity to purchase model, you can more effectively target customers who are most likely to purchase certain products.

A typical end-to-end solution architecture and implementation steps
Identify the data source with past customers purchase history and load it to BigQuery
Prepare the data for ML tasks
Build an ML model to determine which determines propensity of a customer to purchase
Use batch prediction for offline campaign
Use online prediction for online campaign
Join the customer data with a CRM system to gather customer details
ML Pipeline
Next Steps
Summary

A typical end-to-end solution architecture and implementation steps

You will select features, create a label (which tells you if a customer has the propensity to purchase), build a model, predict batch/online in BigQuery using BigQuery ML. BigQuery ML enables you to create and execute machine learning models in BigQuery by using standard SQL queries. This means, you don’t need to export your data to train and deploy machine learning models — by training, you’re also deploying in the same step. Combined with BigQuery’s auto-scaling of compute resources, you won’t have to worry about spinning up a cluster or building a model training and deployment pipeline. This means you’ll be saving time building your machine learning pipeline, enabling your business to focus more on the value of machine learning instead of spending time setting up the infrastructure.

To automate this model-building process, you will orchestrate the pipeline using Kubeflow Pipelines, ‘a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.’

Below is a picture of how the entire solution works:

The solution involves the following steps:

Identify the data source with past customers purchase history and load the data to BigQuery
Prepare the data for Machine Learning (ML) tasks.
Build an ML model which determines the propensity of a customer to purchase
Join the customer data with a CRM system to gather customer details (e.g. email id, etc.)
Determine which product(s) we should recommend the customer
Launch a channel campaign with the above data
Manage the lifecycle of the customer communication (e.g. email) in the CRM or equivalent
Refine the Customer Lifetime Value from the result of the campaign
Monitor the models to ensure that they are meeting the expectations
Retrain the model, if necessary, based on either new dataset or rectified dataset

We are now going to discuss the steps in detail below. There is also an accompanying notebook which implements the first 3 steps of the solution.

Identify the data source with past customers purchase history and load it to BigQuery

Where does your data reside? Determine the best pre-processing techniques to bring the data to BigQuery. You can automate the pre-processing in a MLOps pipeline, which you will see later in the article. The dataset, you are going to use, is hosted on BigQuery, provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store that sells Google-branded merchandise.

Screenshot from the Google Merchandise Store.

Here’s a sample of some of the raw data from Google Analytics:

Prepare the data for ML tasks

Now that you have identified the dataset, you start to prepare the data for your ML model development. Select the appropriate features and labels if you want to use supervised learning. For this article you will use a couple of features for demonstration purposes.

The query below will create the training data, features (`bounces`, `time_on_site`) and a label (`will_buy_on_return_visit`) that you will use to build your model later:

## follows schema from https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089
 
# select initial features and label to feed into your model
CREATE 
OR REPLACE TABLE bqml.rpm_ds.rpm_ds_propensity_training_samples_tbl OPTIONS(
  description = "Google Store curated Data"
) AS 
SELECT 
  fullVisitorId, 
  bounces, 
  time_on_site, 
  will_buy_on_return_visit # <--- your label
FROM 
  # features
  (
    SELECT 
      fullVisitorId, 
      IFNULL(totals.bounces, 0) AS bounces, 
      IFNULL(totals.timeOnSite, 0) AS time_on_site 
    FROM 
      `data-to-insights.ecommerce.web_analytics` 
    WHERE 
      totals.newVisits = 1 
      AND date BETWEEN '20160801' 
      AND '20170430'
  ) # train on first 9 months
  JOIN (
    SELECT 
      fullvisitorid, 
      IF(
        COUNTIF(
          totals.transactions > 0 
          AND totals.newVisits IS NULL
        ) > 0, 
        1, 
        0
      ) AS will_buy_on_return_visit 
    FROM 
      `bigquery-public-data.google_analytics_sample.*` 
    GROUP BY 
      fullvisitorid
  ) USING (fullVisitorId) 
ORDER BY 
  time_on_site DESC # order by most time spent first

Below is the partial result of running the above query:

Build an ML model to determine which determines propensity of a customer to purchase

Which model type should we use? What are all the features to use? What should be the hyper parameter set for the model? These are typical data scientists challenges.

You need to classify if a customer has the propensity to purchase. Hence, it is a classification task. One commonly-used classification model is logistic regression. You will build a logistic regression using BigQuery ML.

The query below will create the model:

CREATE 
OR REPLACE MODEL `rpm_ds.rpm_bqml_model` OPTIONS(
  MODEL_TYPE = 'logistic_reg', labels = [ 'will_buy_on_return_visit' ]
) AS 
SELECT 
  * 
EXCEPT 
  (fullVisitorId) 
FROM 
  `bqml.rpm_ds.rpm_ds_propensity_training_samples_tbl`

Once the model is created, you can check how well the model performs based on certain rules. We have taken some rules-of-thumbs (e.g. ROC-AUC > 0.9) but you can adjust them based on your specific need.

The query below will evaluate the model to check for expected ROC AUC:

SELECT 
  roc_auc, 
  CASE WHEN roc_auc >.9 THEN 'good' WHEN roc_auc >.8 THEN 'fair' WHEN roc_auc >.7 THEN 'decent' WHEN roc_auc >.6 THEN 'not great' ELSE 'poor' END AS modelquality 
FROM 
  ML.EVALUATE(MODEL `rpm_ds.rpm_bqml_model`)

Running the query produces the following output:

Do you really think we managed to build a good model? Hmm…most likely not…Because the ROC_AUC of 78% is considered a fair score. It takes a lot of domain knowledge, hyperparameter tuning, feature engineering, etc. to build a google model. This is not the best model you could create but it gives you a baseline. As the focus of this article is to build an end-to-end pipeline rather than model performance, fine tuning the model is outside the scope of this article.

The trained model can assist you in reaching out to your customers in an offline campaign or in an online campaign. We will use batch prediction for the former and online prediction for the later.

You can use the trained model for batch prediction of a large dataset in BigQuery. Or you could deploy the model in Google Cloud AI Platform for online prediction.

Use batch prediction for offline campaign

For the offline campaign scenario, you can do asynchronous batch prediction on a large dataset. Let us check how to do batch prediction. You can create a table in BigQuery and insert all your inputs for which you want to predict. You will create a table `rpm_ds_propensity_prediction_input_tbl` in BigQuery, with each row as one customer with bounces and time_on_site features. Then use the trained model to predict for all the inputs/rows.

The query below shows the batch prediction:

# predict the inputs (rows) from the input table
SELECT 
fullVisitorId, predicted_will_buy_on_return_visit
FROM
ML.PREDICT(MODEL rpm_ds.rpm_bqml_model,
(
    SELECT
    fullVisitorId, 
    bounces,
    time_on_site
    from bqml.rpm_ds.rpm_ds_propensity_prediction_input_tbl
))

Below is the partial result of running the above query:

In the output above, the model predicted that these four customers have the propensity to purchase as the `predicted_will_buy_on_return_visit` returns 1.

Do you think the model is predicting that each of the above the customer has the propensity to purchase? Maybe. To be sure, you need to dig deeper. You might want to check the features, parameters, threshold (which is 0.5 by default, in the above ML.PREDICT), etc. to tune the model.

Use online prediction for online campaign

For the online campaign scenario, we need to deploy the model in Cloud AI Platform. It is a two step process. First, you need to export the model and then deploy to the Cloud AI Platform, which exposes a REST endpoint to serve online prediction.

Below is the command to export the BiqQuery ML model to Google Cloud Storage :

# export the model to a Google Cloud Storage bucket
bq extract -m rpm_ds.rpm_bqml_model gs://bqml_sa/rpm_data_set/bqml/model/export/V_1

Second, you need to deploy the exported model to Cloud AI Platform Prediction, which hosts your trained model, so that you can send the prediction requests to it.

Below is the command to deploy the model to Cloud AI Platform Prediction:

# export the model to a Google Cloud Storage bucket
# deploy the above exported model
gcloud ai-platform versions create --model=rpm_bqml_model V_1        --framework=tensorflow --python-version=3.7 --runtime-version=1.15        --origin=gs://bqml_sa/rpm_data_set/bqml/model/export/V_1/ --staging-bucket=gs://bqml_sa

Now, you can predict online via a web request/response. You could use the endpoint in your web app to take on the spot action such as displaying personalized content or triggering an async process such as sending a personalized email or a postcard. Below is the command where you can quickly test the online prediction:

# Perform online predict (create a input table with the input features)
# create a json file (input.json) with the below content
{"bounces": 0, "time_on_site": 7363}
 
# use the above json to predict
gcloud ai-platform predict --model rpm_bqml_model --version V_1 --json-instances input.json

Running the command produces output similar to the following:

Predicted results for {"bounces": 0, "time_on_site": 7363} is PREDICTED_WILL_BUY_ON_RETURN_VISIT  WILL_BUY_ON_RETURN_VISIT_PROBS             WILL_BUY_ON_RETURN_VISIT_VALUES
['1']                               [0.9200436491721313, 0.07995635082786867]  ['1', '0']

In the output above, the model predicted that this particular customer has the propensity to purchase as the `PREDICTED_WILL_BUY_ON_RETURN_VISIT` returns 1. Given this test of “0” bounces and “7363” seconds on time_ont_site, the model tells us that there’s a 92% chance they have the propensity to purchase. Using this information, you can then send the customer a coupon (or perhaps you only want to give coupons to people between 0.5 and 0.8 probability, because if it’s a high probability they may purchase the item without incentives).

Of course, you don’t have to use only gcloud, you could certainly use your favorite tools (wget, curl, postman, etc.) to quickly check the REST endpoint.

Join the customer data with a CRM system to gather customer details

So we can now predict if a customer has the propensity to purchase either in batch or online mode. Now what? Well, we are using the fullvisitorid in the data set. We will need the details of the customer such as email address, because your dataset doesn’t have them. The idea is to gather them from a Customer Relationship Management (CRM) system. Thus we need to integrate with a CRM system to accomplish the objective.

You will get an idea of the integration here. The article talks about how to integrate the Google Analytics 360 integration with Salesforce Marketing Cloud. The integration lets you publish audiences created in Analytics 360 to Marketing Cloud, and use those audiences in your Salesforce email and SMS direct-marketing campaigns. You will need to determine the appropriate integration mechanism based on your CRM platform.

The accompanying notebook doesn’t implement this step.

Rest of the steps of the solution

The rest of the steps in the solution are self explanatory though it might not be that trivial to integrate and interoperate the systems. But, well, that’s an ongoing challenge in the Software Development Life Cycle in general, isn’t it? You can also check some guidance as potential next steps continue to build on the existing solution.

ML Pipeline

You are now going to build a ML pipeline to automate the solution steps 1, 2, and 3. The rest of the steps are left for either feature article or as an exercise to the reader. We are going to use Kubeflow Pipelines(KFP) and use managed KFP, Cloud AI Platform Pipelines on Google Cloud.

Below is the visual representation of the solution steps, which are available in a Jupyter notebook in the git repo:

Below is the visual representation of the solutions steps that build on the current implementation. The notebook doesn’t implement these steps:

The three links in the above diagrams are:

Refer to the article, to get an idea about how to use the predicted customers that have propensity to purchase with a CRM system.
Refer to the article, to get an idea about how you can use Matrix Factorization for product recommendation. The article also refers to a notebook.The article uses the same BigQuery public dataset you used in the current article.
Refer to the Chapter 6 “Frequent Itemsets” of the book “Mining of Massive Datasets” authored by a professor from Stanford University, to get an idea of how to build a frequently used product list.

Below is the below output from Cloud AI Platform Pipelines when you run the KFP experiment from the notebook. When you execute the experiment, your output may vary:

Screenshot of Kubeflow Pipelines output of the experiment

Each box represents a KFP component. You can refer to the notebook for the semantics, syntaxes, parameters passings in the function, etc. The notebook demonstrates the following features:

* Environment Setup
    - Setup Cloud AI Platform Pipelines (using the CloudConsole)
    - Install KFP client
    - Install Python packages for Google Cloud Services
* Kubeflow Pipelines (KFP) Setup
    - Prepare Data for the training
        -- Create/Validate a Google Cloud Storage Bucket/Folder
        -- Create the input table in BigQuery
    - Train the model
    - Evaluate the model
    - Prepare the Model for batch prediction
        -- Prepare a test dataset (a table)
        -- Predict the model in BigQuery
    - Prepare the Model for online prediction
    - Create a new revision (for model revision management)
        -- Export the BigQuery Model
        -- Export the Model from BigQuery to Google Cloud Storage
        -- Export the Training Stats to Google Cloud Storage
        -- Export the Eval Metrics to Google Cloud Storage
    - Deploy to Cloud AI Platform Prediction
    - Predict the model in Cloud AI Platform Prediction
* Data Exploration using BigQuery, Pandas, matplotlib
* SDLC methodologies Adherence (opinionated)
    - Variables naming conventions
        -- Upper case Names for immutable variables
        -- Lower case Names for mutable variables
        -- Naming prefixes with rpm_ or RPM_
    - Unit Tests
    - Cleanup/Reset utility functions
* KFP knowledge share (demonstration)
    - Pass inputs params through function args
    - Pass params through pipeline args
    - Pass Output from one Component as input of another
    - Create an external Shared Volume available to all the Comp
    - Use built in Operators
    - Built light weight Component
    - Set Component not to cache

Next Steps

There are a number of improvements or alternatives that you could use for your specific use case.

Alternative Data Sources for Prediction Input Data

You could also use a federated data source, if that works for you.

Alternative Models

We have used logistic regression in this article, but you could use XGBoost or others. You could, also, train multiple models in parallel and then evaluate to decide which model performs better for your scenario. You could do so in the MLOps pipeline and let BigQuery do the heavy lifting for you in taking care of computation required in training the models.

Online prediction endpoint

Now that you have an endpoint for the online prediction, you could publish the API to your web developer and/or to monitor its usage. You could use Google Apigee, an API management platform, for doing so and much more.

Determine which product(s) we should recommend the customer

After we determine, which customer(s) has the propensity to purchase, we need to find out what product(s) we should recommend to them. We could launch a campaign which uses the personalized messages for a targeted customer base. However, the accompanying notebook doesn’t implement this step.

There are many approaches to determine the product(s) that we should recommend to the customer. You could use either a Matrix Factorization or a MarketBasket FrequentItemSet technique as your starting point.

MarketBasket model could also provide an online stateful prediction model i.e. when a customer browses (or any equivalent action such adds to the Shopping Cart) Product A, then you could recommend the next Product B and so on. You can embed the intelligence in your web application to better understand the customer’s intent. As an illustration if someone is purchasing Diaper you might want to recommend them to purchase Beer (based on an interesting discovery, please check the Chapter 6 material reference below for additional information.) Or if a customer has added potting soil to their cart, you could recommend them plant food. The better you learn about the customer’s intent the better you could recommend other pertinent products. Choices are endless…

Summary

Congratulations! You now learnt how to build a Propensity model using BigQuery ML and how to orchestrate an ML pipeline in Kubeflow Pipelines. You could continue to build upon the current solution for a complete activation but you could use the notebook as a starting point.

Want more?

Please leave me your comments with any suggestions or corrections.

I work in Google Cloud. I help our customers to build solutions on Google Cloud.