Productionising machine learning in retail

Published in

Slalom Australia

6 min readNov 11, 2021

Applying machine learning (ML) is becoming increasingly important in the retail industry. However, for machine learning to have a positive impact on any organisation, it needs to be productionised. Currently, up to 60 per cent of ML isn’t productionised correctly, so the ability to successfully deliver a machine learning project end-to-end has recently become a hot topic in the industry.

The hidden technical debt in ML systems

Back in 2015, ML specialists at Google pointed out that while developing and deploying ML systems is relatively fast and cheap, maintaining them over time is difficult and expensive.[1].

The ML model is a very small piece of ML system required for the model to function successfully in a production ecosystem.

As demonstrated above, ML code is heavily supported by the rest of the components, which requires a lot of engineering and devops skills (MLOps). Slalom has written on the importance of the MLOps capability in a separate blog here.

Context

Recently, Slalom engaged a major Australian retail client to build a “share of wallet” model to predict all customers annual spend within all subsidiaries respectively. The model is supposed to be trained on top of a dataset from MasterCard (millions of records that refresh quarterly), however, the pattern learned by the model should be generalised to all customers (inference is meant to happen fortnightly) regardless if they possess a MasterCard or not. This model aims to help them re-target customers more accurately and provides a different strategy to customers accordingly.

Modelling approach

At first glance, this should be modelled as a domain adaptation problem, which means the pattern learned in the source domain (MasterCard data) can be transferred into the target domain (non-MasterCard customers). A classical solution is to use domain separation networks (DSNs) [2] to separate similarities and differences from two domains and, in the meantime, train a robust classifier which can perform well on both domains.

Domain separation networks aim to separate similarities and differences between source domain and target domain while using autoencoders to maintain the distribution and training of a robust classifier, which performs well on both domains.

However, to simplify the solution and reduce time, as well as cost, in training, we made the hypothesis that the spending behaviour between MasterCard users (~3M) and non-MasterCard users (~4M) follows the same distribution. This way, we modelled it into a simple classification problem, meaning a classification model is trained on top of MasterCard user data and then used to infer on all customer data. In hindsight we learned that this assumption worked well because we aggregated most of the features within a year’s window which diluted the impact of individual’s difference — thereby reducing the processing time significantly.

Feature selection

Since the share of wallet of a customer is directly related to the annual spend, we selected overall spend and average spend over a three, six, and 12 month window respectively as primary features , based on an academic paper for this specific share of wallet [3]. Sales related data in the customer’s favourite stores and categories are also factored in as features. Competitor data should be useful because share of wallet is also determined by how much a customer spent with competitors in the past year, so we added geographical data as a feature. e.g. number of competitors within five kilometres of the customer’s residence.

By adding this data, we were able to view features from other subsidiaries as competitor data, and cross-reference the model.

To efficiently reuse the features we generated, we adopted DBT (Data Building Tool) to transform data. DBT allows companies to write transformations as queries and orchestrate them in a more efficient way.

DBT can help data engineers and scientists focus on the logic of transformation rather than on complex solutions of data jobs and stored procedures.

The data is sitting in BigQuery, so DBT can be obtained by simply pip install dbt-bigquery [4] in this scenario. We use the same features for training on MasterCard customers and prediction on all customers. The only difference being, the data is aggregated from the date MasterCard data is available to one year ago for training, while it is from the run date to one year ago for prediction.

Architecture Design

To productionise the solution, we chose Kubeflow within the AI Platform of Google Cloud Platform (GCP) to automate the process, given the client was using GCP already. At its core, Kubeflow offers an end-to-end ML stack orchestration functionality to deploy, scale and manage complex systems. Kubeflow also has built-in support of metrics visualisation, e.g. confusion matrix. The overall architecture is illustrated below.

Overall architecture of the productionised system.

An invoker is deployed in Cloud Run and provides endpoints to trigger deployment of a new Kubeflow pipeline or run an experiment using the existing pipeline. This way, a person can simply manipulate the pipeline by sending post requests to the endpoints or schedule it via Cloud Scheduler.

DBT will be triggered in prepare training/predict components in Kubeflow to prepare features. Then, training component starts reading data from BigQuery and feeding it into the model. Once the training is finished, evaluation results like confusion matrix are exported as artifacts in visualisation.

A sample of output metrics and confusion matrix

In the end, the predict component will download the trained model from GCS (Google Cloud Storage) to inference on all customers features and dump the predictions into the destination BigQuery table.

Kubeflow pipeline

Kubeflow provides a built-in graph illustration of dataflow. The graph we defined is as below.

There are two branches in the graph — training branch on the left and predict branch on the right. The first component on the top is a conditional operator to decide which branch to run. The conditional component will check if there is a need to run training, If yes, both training and predict will be launched in the left branch. Predict component will wait for training to complete and pull saved model from GCS to kick off batch inference.

If it is a predict only run according to the conditional operator, the latest saved model path will be retrieved as input for predict use. Before training/predict happens, DBT will be triggered in prepare train/predict component with different date ranges to create feature tables for training/predict. By default, the training branch is triggered when new MasterCard data is available (every three months) and predict branch is triggered every two weeks.

Model selection

We started with Random Forest model to provide easy feature analysis and explanability and then we tried XGBoost to improve the performance. However, thanks to the huge amount data from the client, we finally went with a deep learning approach and implemented it in Keras.

Deep learning based model performs best comparing to traditional models

We applied train/valid/test split ratio as 60/20/20 with early stopping. Dropout and L2 regularization are incorporated to avoid overfitting. We also tried out CNN (convolutional neural network) but found it overfit to validation split, due to the stronger learning power and didn’t perform as good as fore-mentioned two layers feed forward network.

Categorical features and numerical features were transformed by one-hot encoders and min-max scalers respectively before fed into model. They were concatenated using sklearn pipeline so that they could be saved and loaded from inference module.

Future work

As this is just a start point and hypothesis that MasterCard customers have similar behaviours with non-MasterCard customers, we will experiment with domain adaptation approaches to solve this problem. We won’t include the end results as they will vary depending on the data availability and quality, the particular customer dynamics and because of confidentiality. However, we did find decent results through the project using transaction data to approximate a share of wallet. We are happy to discuss in the future.

Acknowledgement

We would like to show our gratitude to Rene ESSOMBA and Ziping Rao for sharing their pearls of wisdom with us during the course of this engagement. We are also immensely grateful to Jacqui Lillyman, Bhrett Brockley, Robert Sibo, Lam Truong for their comments on an earlier version of the manuscript, although any errors are our own and should not tarnish the reputations of these esteemed persons.