Clickstream Pattern Analysis and Prediction using Machine Learning [GHC ‘19]

Ananya Sen
6 min readJul 17, 2020

--

This paper was presented at the Grace Hopper Conference 2019, in the poster format. It was originally submitted by Neha Kumari and me Ananya Sen, as Software Engineers at Intuit. GHC is extremely competitive, with an acceptance rate of only 15% for GHC 19, from submissions across the world. Grateful to have been a part of that!

ABSTRACT
Clickstream analysis is key to finding user behavioral patterns like user drop-off and anomalies. Predicting drop-off can improve the customer conversion and retention. We’ll cover the process and algorithms for clickstream data analysis, clustering, anomaly detection and machine learning. We’ll also showcase the process and algorithms we used for drop-off prediction to achieve a high accuracy of 93.36%.

TARGET AUDIENCE
Areas: Clickstream, Data Science, Machine Learning, Feature Engineering

Technical level: Beginner to intermediate experience in full-stack engineering, data engineering or data science.

INTRODUCTION

Today’s fiercely competitive market depends heavily on user generated data. Consider the use case of the Intuit apps QuickBooks Self-Employed (QBSE) or Mint, where the user connects their banks so that they can run supply chain, payroll, etc., or get recommendations to reduce their expenses.

One of the biggest pains we see at Intuit in our clickstream analytics, is customer churn and drop-off, while connecting their bank in the product. Clickstream gives us a rich dataset to analyze and model, to understand user behavior and predict user drop-off. This prediction, available over a simple REST API, can facilitate several user experience enhancements in real- time.

USE CASE FOR CLICKSTREAM

Bank Connection UI

This is the part of the QBSE and Mint app where the user searches and connects their banks to the product. The types of clickstream data collected in this flow are browset_type, bank_selection, search_event, user_data_consent_agreed, recaptcha_success, oauth_launch, bank_connected, etc.

MACHINE LEARNING PROCESS

The basic steps followed for data analysis and machine learning are as follows:

  1. Data ingestion, data cleaning
  2. Data exploration, like cluster analysis, etc.
  3. Feature engineering
  4. Train/test data split
  5. Training the Machine Learning model
  6. Prediction on test data

For our specific use case, the high-level process entailed analyzing the user behavior patterns from batch data, and then predicting it for new users at runtime. The output of this algorithm can be used by the product to change the user experience in real-time, if the user is showing indications of drop-off with a high probability.

Machine learning process

In the below sub-sections we will go through the details for each step.

DATA INGESTION ARCHITECTURE

High-level Data Ingestion Architecture

As in the above architecture, the UI uses different libraries for web and mobile to send the clickstream data over a REST API to the data lake. Here the data is streamed and batch processed into an Apache Hadoop datastore, through an Apache Hive layer. The data processing includes data cleaning and data mapping. This data is then fed into a Vertica analytics database for reporting and visualizations.

DATA EXPLORATION

We used a sample data size of ~10 million Clickstream events, for 100k unique users. There are different types of analysis that can be performed on the sample data, like Cluster Analysis, User Segmentation, etc.

Here is an example of K-Means Cluster Analysis we used to find the cluster of users who did not finish the flow to make the bank connection. This showed us interesting details like some users take 3 steps to connect whereas some 10.

K-Means Cluster showing the group of failed connection

Once a cluster is found for a meaningful use case, one can focus on finding more specific patterns in the user’s behavior.

FEATURE ENGINEERING

The most difficult part of building any machine learning platform is feature engineering, which means finding the most important characteristics of data.

For Clickstream data the 2 types of models used are:

Click sequence model: If a flow consists of multiple screens, then this is the number of times a user is on a particular page in the flow. With this, a user traversal vector can be created to visualize the user’s navigation through the product.

Time-based model: The amount of time spent on each screen can tell us a great story about customer affinity towards a product and their behavioral pattern.

Features that are derived from these 2 models are:

  1. no_times_page_x
  2. total_time_page _x

where, x = {1, 2, 3, …. N} and

N is the number of screens in the flow.

The Target labels from these 2 models are:

2, 5, 6, ….,0.2, 0.5, 1… 0

5,8, 2, ….,0.1, 0.9, 1… 1

TRAINING/TEST DATA SPLIT

Once we have access to the clean data, it is critical to split the data for training the machine learning model and testing the model to reduce overfitting or under-fitting. We used a sample data size of ~10 million Clickstream events, for 100k unique users. For this methodology, some good splits were: 70% : 30% , 75% : 25% , 80% : 20%

PREDICTIVE ANALYSIS ALGORITHMS

Among many machine learning algorithms, Ensemble Learning methods have a better predictive performance, since the use multiple learning algorithms to find the result. Some of the popular ensemble learning methods are Random Forest and Support Vector Machine. These are binary classifiers used to find the probability of a decision in the given training data.

For our clickstream analysis, we focused on these algorithms to predict user drop-offs. The best prediction accuracy of 93.36% was found with the Random Forest algorithm.

We also produced the following feature importance chart that helps with feature scoring to further refine the model.

Feature Importance Chart

OUTCOMES AND NEXT STEPS

The below diagram shows the average precision, recall and feature score, found for the 2 classifications during the learning process of the user behavior patterns from the batch data. The 2 decisions were:

  1. notSuccessfullyConnected: users who dropped off
  2. successfullyConnected: users who connected to their bank
93.36% User drop-off prediction accuracy achieved

The output of this algorithm can now be used via an HTTP API by the product to change the user experience in real-time if the user is showing indications of drop-off with a high probability. This can help reduce customer churn and increase customer conversion and retention. Also for further research we can also use cross validation and accuracy to find out how well the model is performing.

CONCLUSION

This presentation demonstrates the basic principles and algorithms for data analysis, clustering, anomaly detection and machine learning, for the audience. We also provide a real-world problem with the Clickstream analysis example, for the audience to get a sense of predictive analysis on high-volume data. Ultimately, with the focus on user drop-off prediction, we demonstrate how to find an actionable insight. These steps can now be morphed for specific use cases for further benefits.

REFERENCES

Original Submission

[i] Gang Wang, Xinyi Zhang, Christo Wilson, Haitao Zheng. Clickstream User Behavior Models.

[ii] Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Lui Ma. Visual Cluster Exploration of Web Clickstream Data.

[iii] Raheel Shaikh. Feature Selection Techniques in Machine Learning with Python.

[iv] Everaldo Aguiar, Nitesh V Chawla, Saurabh Nagrecha. Predicting Online Video Engagement Using Clickstreams.

--

--

Ananya Sen

*Product & *Engineering, *Swimmer, Health *Food enthusiast, nascent *Writer & *Singer, continuous *Learner