In Search of Spending — Part 1

Jim Fay
The Startup
Published in
6 min readOct 25, 2020

Predicting Customer Spending in Google’s Online Store

This blog is part one of a two part walkthrough of a recent machine learning project. View the entire project (including all code and the accompanying slide deck) on Github. Keep an eye out for part two, where I’ll go through the modeling process and results.

Introduction:

E-commerce is becoming a larger and larger part of all of our lives. In the US, $602 billion was spent online in 2019. About 75% of those who shop online do so at least once per month. Online shopping made up 16% of all retail revenue in 2019, and that number is only growing.

This growth means that the importance of an online presence is growing for businesses. It also means that there is more e-commerce data available than ever before. This is a perfect storm for a data scientist.

One problem that can be tackled using this wealth of data is the problem of predicting customer spending. Can we anticipate how much a customer will spend based on factors like what device they’re using, the time of day, and their browsing behavior? If this problem can be solved, there are numerous benefits to the business in question:

  • Personalized Marketing to Encourage Spending
    Knowing how much a customer is likely to spend can be instrumental in persuading them to spend a bit more. For example, if we predict a customer will spend $150 in their visit to the store, we could offer them free shipping or loyalty points for any purchase above $200. Additionally, if a customer is unlikely to buy anything, they could be sent a time sensitive coupon for 10% off their next purchase.
  • Identification of Bottlenecks and Spending Blockers
    Regression models like random forests allow us to see which features are most important in determining spending. For example, if we see that mobile users or customers browsing the site with Firefox are spending less, we could then investigate further to see if slow load times or other errors are to blame.
  • More Accurate Revenue Forecasting
    Improving this model to production-level quality could give more granular insights into how much a company can expect to earn from an online store.

The Data:

The bulk of the data used in this project was taken from Kaggle. It includes 717k rows, with each row being one visit to Google’s online store between 2016 and 2018. Features describing geography, traffic source, device properties, page views, time, price, and spending are included in the dataset.

Data Cleaning & Preprocessing:

  • Unpacking the nested structure of the original data:
    The data set from Kaggle is populated with some columns that have multiple features in a nested format. In order to conduct proper modeling, this nested data was separated into distinct columns.
  • Choosing features:
    A series of nested columns labeled ‘hits’ were tricky to deal with. The columns included long lists of information about the pages and products the customer viewed in their visit to the store. Long lists of products don’t fit well into regression algorithms, so I chose to take a few key values from these nested columns. Most notably the product category of the most recently viewed item, and that item’s price.
  • Dealing with categorical values:
    The data set includes categorical variables, many of which have hundreds or thousands of unique values. In order to model this data, values with <500 instances were grouped into a single ‘Other’ category.
  • Dealing with missing values:
    Some features were not made available by Google. These columns were removed from the dataset. Values of ‘(not set)’, ‘(not applicable)’, or similar were all standardized to a single ‘None’ value. Missing values for continuous variables were imputed based on average values.

Feature Engineering:

  • K-Means Clustering
    Clustering is a common unsupervised learning technique that can add a bit of extra ‘oomph’ to a supervised learning model. In this case I used the K-Means clustering algorithm with a ‘k’ value of 60. Both cluster labels and silhouette scores for each point were added to the data set as additional columns. Selected code from the clustering process is shown below. Due to constraints on time and processing power, a reduced (using PCA) version of the dataset was used for the clustering process. Full code can be found in the project’s Github repo.
def add_clustering(data, reduced_data, clusters):
#Create best model
best_cluster = KMeans(n_clusters=clusters, random_state=70, algorithm='full')

#Fit model.
best_cluster_fit = best_cluster.fit(reduced_data)
print('Done Fitting Model')

#Get labels for use in silhouette score calculations
best_cluster_labels = best_cluster_fit.predict(reduced_data)
print('Done Predicting')

sil_scores = silhouette_samples(reduced_data, best_cluster_labels)
print('Done with Silhouette Scores')

#Turn into series objects
sil_scores = pd.Series(sil_scores, name='sil_score')
best_cluster_labels = pd.Series(best_cluster_labels, name='cluster_label')

#Merge with original data
data_merged = data.reset_index()
data_merged.drop('index', axis=1, inplace=True)
to_concat = [data_merged, sil_scores, best_cluster_labels]
data_merged = pd.concat(to_concat, axis=1)

return data_merged
  • Feature Transformation
    The variable we are predicting (spending/revenue per visit) was transformed so that we predict the natural log of spending instead of the USD value itself.
  • Additional Data
    In order to account for trends in overall consumer spending over the time period the data was collected, I incorporated various economic indicators into the dataset. These include daily values for the S&P 500, the US Dollar Index, and the Consumer Confidence Index. To improve the modeling results a time lag variable could also be added to the data in future iterations of this project.

Exploratory Data Analysis:

Modeling itself does not make a complete data science project. An equally important part of any analysis is exploring the data in order to better understand it. Some key findings and graphs from this process are shown below.

  • Purchase Count

Most visits to Google’s store did not result in a purchase of any size. In fact, only 2.46% of visits resulted in a purchase. The average purchase size of that 2.46% of visits was $124 USD.

  • Purchase Size

Most purchases are small in size (under ~$100). In the graph below we can see that the number of purchases sharply decreases as the total purchase amount increases. This graph is cropped along the x-axis, but there were a limited number of very high value purchases ranging up to $24,000.

Count of Visits by Amount of Spending in USD
  • Page Views vs Visit Result

There is a stark difference in the number of pages viewed by customers that make a purchase vs those that don’t. This could be because if customers can’t find what they’re looking for quickly they won’t make a purchase, while if they are interested in a product they may be more inclined to click through more pages. This idea that keeping customers on the site could increase spending is confirmed in the results of the modeling.

Average Number of Page Views by Result
  • Page Views vs % of Visits Ending in Purchase

Here we dig a bit deeper into the relationship between page visits and spending. This graph shows a clear relationship between increasing page views and greater likelihood of making a purchase.

Percent of Visits Ending with a Purchase by Number of Pages Viewed
  • Page Views vs Purchase Size

Here we can see that not only does a higher number of page views increase the likelihood of a ‘purchase’ result, it also has a positive relationship with the size of the purchase made.

Average Purchase Size by Number of Pages Viewed
  • # of Site Visits vs Visit Result

If a customer returns to the store after a previous visit, they may be more likely to make a purchase.

Average Number of Visits to the Store by Result

EDA Takeaways:

  • Purchases are the Exception — Most customers who visit the store do not make a purchase.
  • Returning Customers Make Purchases — Customers who have visited the store before are more likely to make a purchase.
  • More Active Customers Buy More — Customers that visit more pages in a visit tend to spend more, and make purchases more often.

Stay tuned for part two of this blog series where I’ll go through the iterative modeling process, modeling results, and final conclusions.

Sources:

--

--