As for every business, the number of customer that generate revenue will be far less than the total customers that the business interacts with. So for every business it's really important to understand, analyse and predict the areas of its revenue generation.
Kaggle Challenge Description:
Visit: Google Analytics Customer Revenue Prediction The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.
Source Of Data & Methodology:
Machine Learning Methodology:
In this challenge, we are tasked to predict the revenue generated by each over the period of December 1st 2018 to January 31st 2019 by using the historical data from August 1st 2016 to October 15th 2018. Please note that for the time period that we are predicting we don’t have any data.
We have been provided with
train_v2.csv - the updated training set - contains user transactions from August 1st 2016 to April 30th 2018.
test_v2.csv - the updated test set - contains user transactions from May 1st 2018 to October 15th 2018.
sample_submission_v2.csv - a updated sample submission file in the correct format. Contains all fullVisitorIds in testv2.csv. We have to Predicted LogRevenue column should make forward-looking predictions for each of these _fullVisitorIds for the timeframe of December 1st 2018 to January 31st 2019
File Descriptions and Features Provided:
Each .csv files has the data for the transactions for number of store vistis fullVisitorId — A unique identifier for each user of the Google Merchandise Store. Our final submission will be dependent on this and will be used for aggregation.
channelGrouping — The channel via which the user came to the Store.
date — The date on which the user visited the Store.
device — The specifications for the device used to access the Store.(json column)
geoNetwork — This section contains information about the geography of the user.(json column)
socialEngagementType — Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
totals — This section contains aggregate values across the session.(json column)
trafficSource — This section contains information about the Traffic Source from which the session originated.(json column)
visitId — An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
visitNumber — The session number for this user. If this is the first session, then this is set to 1.
visitStartTime — The timestamp (expressed as POSIX time).
hits — This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
customDimensions — This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.(json column)
json column: As we can see, there are few columns that are compressed in json format, we will normalize and convert it to normal feature.
Evaluation Metric and Target Variable:
For each user we have to find out the aggregate of all transactions during the test data period and it will be log of the value.
Why use log on target variable?
We are asked to use log on the target variable:
As we can see from this plot, the data is very highly skewed. Very small number of customers are actually generating any revenue. Only 1.2193% of the customers are generating any revenue. So we are tasked to use log on the target variable. Source
Let’s look at all the features we get after flattening all the json columns.
Exploratory Data Analysis & Data Pre-processing:
totals.transactionRevenue is our target variable. We have 1,708,337 data points in train data. Let’s look at the univalent plots for target variable
The plot on the right, the PDF of actual target_variable. The plot on the left is the PDF of log(target_variable).
Observations: The PDF of totals.transactionRevenue, is almost normal, with mean at 17.5.
This is the bar plots of channelGrouping indicating the counts on y_axis. We can see that there are resonable number of unique categories in this features.
But some of the features contains like about 100+ unique categories. For Ex: device.browsers
As we can see there are a lot of categories. For this feature we have 129 categories. As we can see in the plot. Only few of the categories are repeating reasonable times rest of them are like outliers. Most of them are repeating less than ten times. So we can understand that the categories that are repeating few times are not the reason for generating revenue.
Since all non revenue generating data points are considered as zero, our model is only expected to predict greater than equal to 0. Let’s look at the percent of customers per browser.
As we can see after, Opera it’s been used by very customers. Hence, we only keep the most popular browsers.
With this code we are limiting the number of categories in this feature. After preprocessing, we will be getting the below bar plot of browser counts. Now we can see that the number of unique browsers has reduced significantly.
Please note that the goal of doing this preprocessing is not to use one hot encoding, but to remove the outliers during the data preprocessing phase. This is only to reduce the cardinality of some of the features.
This will be particularly helpful when we have many unique categories. There are few features for which the unique categories are more than 500 categories. In that case we can limit the number of categories by looking at the distribution of the category value counts.
This feature contains the name of the source city. We have the same problem as above. Like there are about 950+ unique cities. So it would not be possible to manually select few cities from our intuition.
Let’s look at the PDF of the value counts all the cities.
Observation: As we can observe from the percentile plot on the left, there are many cities that are repeating with very less frequency compared to the cities from 90th to 100th percentile. The plot on the right is same plot but only between 0th and 90th percentile.
As we can see from these plots, only after the 80th percentile that the cities are repeating for more than 250, we will be grouping all the cities with lesser frequency into a single category. By performing this we don’t loose much information because 250 is just a fraction of the size in our total train data.
After performing this pre-processing we still get about 185 unique cities.
There are few features which has similar problem. We have performed same pre-processing methodology as above. Columns with many category — geoNetwork.country, geoNetwork.city,geoNetwork.city
totals.hits totals.hits shows the number of times a store (or) product page URL has been access before making a transaction.
- Most of the transactions are happening below hits = 150.
- As we can see there are few outliers in the data but since the percentage of outliers very less we will continue to keep the data as is.
This feature will hold the value of the source, i.e. like the website from where it’s been redirected from. But the problem is we have categories which mean the same source name but the wording is not the same. Please see the example below. As highlighted above google, sites.google.com and group.google.com are all from the same parent website but it’s been termed as different. So we are going to group all them into ‘google’. And similar kind of grouping has been done for the major websites.
With this script, we group all the source websites into single group based on the parent website.
The revenue generated has been fairly consistent over the years. But please do note that the data for 2018 is only available until April 2018.
- We can see from this plot that there is a pattern in the way revenue has been distributed over the months.
- The revenue peaks out in June and August, and generally we can expect more sales during Summer sales.
- We can also note that the sales reduces drastically during fall season suggesting that the products in the store are not catered towards fall holiday season.
- This is a plot of date vs. transaction.revenue.
- The sales are mostly evenly split over all the days, except for the first part of first week
All the features that have some categorical data will go through some of the data preprocessing techniques as above. Now let’s understand the provided data for building a model. We were provided with Data from
August 2016 to
October 2018, as a whole when including train and test data, but we are predicting for
December 2018 to
January 2019. This means that for those submission periods we don't have the data. So even though we have the test data the target variable is in the future, so the transaction.revenue that is provided for test data is for that period.
Since we are predicting for future we have to we train the model several time by fixing a particular training window for
X and future training window for our response variable(
The table will show all the train and repose windows. So the training phase will be done for each period but the test will remain same for all the training phases.
Why are we doing this?
The objective is to predict the total transaction.revenue for each customer not the revenue of each session or visitID.
We split the data based on time because, if a person buy something in 2016 and is also a resular vistor over the years. The probability that he is going to spend higher amount in 2019 would be much lesser than a person who first visited the store in 2018.
That is the reason we train the model across various time frames and predict the revenue on test data at each time frame.
We will average all of the observations at the end for our final prediction.
We will try three advanced ML models first and then tried to stack for better results.
We tried XGBOOST, lightbgm and catboost.
Here are the results of my submissions from kaggle.
From the results we conclude that
xgboostregressor is the best performing model.
As you can see, we have split the entire data into number of segments based on time. But we have taken the average of all the test predictions we got from various time frames, we have taken the average that means we are giving equal importance to all the time frames.
If we were given with the actual
totals.transactionRevenue, then we could train a model by taking test predictions from various time frames into a vector(10 in this case) and the actual
totals.transactionRevenue as target variable.
- Test_score by Konstantin Nikolaev — Its written Apache 2.0 but his approach for time based training helped me alot.
- Simple Exploration+Baseline — GA Customer Revenue — Got the intuition of EDA part and helped in modelling of my first cut approach which is to current day prediction
- Recruit Restaurant Visitor Forecasting- This also had similar problem statement as this kaggle challenge)
- Applied AI:
LinkedIn: Harsha Vardhan Maagalam