Google Analytics Customer Revenue Prediction

Image for post
Image for post
GStore

Business Problem:

As for every business, the number of customer that generate revenue will be far less than the total customers that the business interacts with. So for every business it's really important to understand, analyse and predict the areas of its revenue generation.

Kaggle Challenge Description:

Visit: Google Analytics Customer Revenue Prediction The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

‌RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.

‌In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

Source Of Data & Methodology:

In this challenge, we are tasked to predict the revenue generated by each over the period of December 1st 2018 to January 31st 2019 by using the historical data from August 1st 2016 to October 15th 2018. Please note that for the time period that we are predicting we don’t have any data.

We have been provided with train_v2.csv and test_v.csv. train_v2.csv - the updated training set - contains user transactions from August 1st 2016 to April 30th 2018. test_v2.csv - the updated test set - contains user transactions from May 1st 2018 to October 15th 2018.

sample_submission_v2.csv - a updated sample submission file in the correct format. Contains all fullVisitorIds in testv2.csv. We have to Predicted LogRevenue column should make forward-looking predictions for each of these _fullVisitorIds for the timeframe of December 1st 2018 to January 31st 2019

Each .csv files has the data for the transactions for number of store vistis fullVisitorId — A unique identifier for each user of the Google Merchandise Store. Our final submission will be dependent on this and will be used for aggregation.

channelGrouping — The channel via which the user came to the Store.

date — The date on which the user visited the Store.

device — The specifications for the device used to access the Store.(json column)

geoNetwork — This section contains information about the geography of the user.(json column)

socialEngagementType — Engagement type, either “Socially Engaged” or “Not Socially Engaged”.

totals — This section contains aggregate values across the session.(json column)

trafficSource — This section contains information about the Traffic Source from which the session originated.(json column)

visitId — An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.

visitNumber — The session number for this user. If this is the first session, then this is set to 1.

visitStartTime — The timestamp (expressed as POSIX time).

hits — This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.

customDimensions — This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.(json column)

json column: As we can see, there are few columns that are compressed in json format, we will normalize and convert it to normal feature.

Evaluation Metric and Target Variable:

Image for post
Image for post
Metric

For each user we have to find out the aggregate of all transactions during the test data period and it will be log of the value.

Why use log on target variable?

We are asked to use log on the target variable:

Image for post
Image for post

As we can see from this plot, the data is very highly skewed. Very small number of customers are actually generating any revenue. Only 1.2193% of the customers are generating any revenue. So we are tasked to use log on the target variable. Source

Let’s look at all the features we get after flattening all the json columns.

Image for post
Image for post
Feature details

totals.transactionRevenue is our target variable. We have 1,708,337 data points in train data. Let’s look at the univalent plots for target variable

Image for post
Image for post
target variable PDF

The plot on the right, the PDF of actual target_variable. The plot on the left is the PDF of log(target_variable).

Observations: The PDF of totals.transactionRevenue, is almost normal, with mean at 17.5.

Image for post
Image for post

This is the bar plots of channelGrouping indicating the counts on y_axis. We can see that there are resonable number of unique categories in this features.

But some of the features contains like about 100+ unique categories. For Ex: device.browsers

Image for post
Image for post

As we can see there are a lot of categories. For this feature we have 129 categories. As we can see in the plot. Only few of the categories are repeating reasonable times rest of them are like outliers. Most of them are repeating less than ten times. So we can understand that the categories that are repeating few times are not the reason for generating revenue.

Since all non revenue generating data points are considered as zero, our model is only expected to predict greater than equal to 0. Let’s look at the percent of customers per browser.

Image for post
Image for post

As we can see after, Opera it’s been used by very customers. Hence, we only keep the most popular browsers.

With this code we are limiting the number of categories in this feature. After preprocessing, we will be getting the below bar plot of browser counts. Now we can see that the number of unique browsers has reduced significantly.

Image for post
Image for post

Please note that the goal of doing this preprocessing is not to use one hot encoding, but to remove the outliers during the data preprocessing phase. This is only to reduce the cardinality of some of the features.

‌This will be particularly helpful when we have many unique categories. There are few features for which the unique categories are more than 500 categories. In that case we can limit the number of categories by looking at the distribution of the category value counts.

This feature contains the name of the source city. We have the same problem as above. Like there are about 950+ unique cities. So it would not be possible to manually select few cities from our intuition.

‌Let’s look at the PDF of the value counts all the cities.

Image for post
Image for post

Observation: As we can observe from the percentile plot on the left, there are many cities that are repeating with very less frequency compared to the cities from 90th to 100th percentile. The plot on the right is same plot but only between 0th and 90th percentile.

As we can see from these plots, only after the 80th percentile that the cities are repeating for more than 250, we will be grouping all the cities with lesser frequency into a single category. By performing this we don’t loose much information because 250 is just a fraction of the size in our total train data.

Code:

After performing this pre-processing we still get about 185 unique cities.

‌There are few features which has similar problem. We have performed same pre-processing methodology as above. Columns with many category — geoNetwork.country, geoNetwork.city,geoNetwork.city

totals.hits totals.hits shows the number of times a store (or) product page URL has been access before making a transaction.

Image for post
Image for post

Observation:

  • Most of the transactions are happening below hits = 150.
  • As we can see there are few outliers in the data but since the percentage of outliers very less we will continue to keep the data as is.

This feature will hold the value of the source, i.e. like the website from where it’s been redirected from. But the problem is we have categories which mean the same source name but the wording is not the same. Please see the example below. As highlighted above google, sites.google.com and group.google.com are all from the same parent website but it’s been termed as different. So we are going to group all them into ‘google’. And similar kind of grouping has been done for the major websites.

Image for post
Image for post

With this script, we group all the source websites into single group based on the parent website.

Image for post
Image for post
Year Wise Revenue

Observations:

The revenue generated has been fairly consistent over the years. But please do note that the data for 2018 is only available until April 2018.

Image for post
Image for post
Month Wise Revenue

Observation:

  • We can see from this plot that there is a pattern in the way revenue has been distributed over the months.
  • The revenue peaks out in June and August, and generally we can expect more sales during Summer sales.
  • We can also note that the sales reduces drastically during fall season suggesting that the products in the store are not catered towards fall holiday season.
Image for post
Image for post
Day Wise Revenue
  • This is a plot of date vs. transaction.revenue.
  • The sales are mostly evenly split over all the days, except for the first part of first week

All the features that have some categorical data will go through some of the data preprocessing techniques as above. Now let’s understand the provided data for building a model. We were provided with Data from August 2016 to October 2018, as a whole when including train and test data, but we are predicting for December 2018 to January 2019. This means that for those submission periods we don't have the data. So even though we have the test data the target variable is in the future, so the transaction.revenue that is provided for test data is for that period.

Since we are predicting for future we have to we train the model several time by fixing a particular training window for X and future training window for our response variable(totals.transactionRevenue).

The table will show all the train and repose windows. So the training phase will be done for each period but the test will remain same for all the training phases.

Image for post
Image for post
Time Stamps

The objective is to predict the total transaction.revenue for each customer not the revenue of each session or visitID.

We split the data based on time because, if a person buy something in 2016 and is also a resular vistor over the years. The probability that he is going to spend higher amount in 2019 would be much lesser than a person who first visited the store in 2018.

‌That is the reason we train the model across various time frames and predict the revenue on test data at each time frame.

‌We will average all of the observations at the end for our final prediction.

We will try three advanced ML models first and then tried to stack for better results.

We tried XGBOOST, lightbgm and catboost.

Here are the results of my submissions from kaggle.

Image for post
Image for post
Results of various models

From the results we conclude that xgboostregressor is the best performing model.

As you can see, we have split the entire data into number of segments based on time. But we have taken the average of all the test predictions we got from various time frames, we have taken the average that means we are giving equal importance to all the time frames.

If we were given with the actual totals.transactionRevenue, then we could train a model by taking test predictions from various time frames into a vector(10 in this case) and the actual totals.transactionRevenue as target variable.

  1. Test_score by Konstantin Nikolaev — Its written Apache 2.0 but his approach for time based training helped me alot.
  2. Simple Exploration+Baseline — GA Customer Revenue — Got the intuition of EDA part and helped in modelling of my first cut approach which is to current day prediction
  3. Recruit Restaurant Visitor Forecasting- This also had similar problem statement as this kaggle challenge)
  4. Applied AI:

LinkedIn: Harsha Vardhan Maagalam

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Harsha Vardhan Maagalam

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Harsha Vardhan Maagalam

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store