Creating a Pageviews-Based Recommendation System Using GA4 Data in BigQuery

Published in

Data on Cloud: GenAI, Data Science, and Data Engineering Insights

8 min readAug 13, 2024

Introduction

In the digital age, the vast amounts of data generated by eCommerce platforms are invaluable for understanding consumer behavior and enhancing their shopping experience. Google Analytics 4 (GA4) plays a pivotal role in this landscape by providing comprehensive insights into user interactions across websites. Particularly for eCommerce websites, GA4 data, when exported to BigQuery, becomes a powerful tool for detailed analysis and strategic decision-making.

BigQuery, Google’s serverless data warehouse, serves as a robust platform for handling GA4’s output raw data. It allows analysts to easily access distinct anonymous identifiers like user_pseudo_id, track individual user’s product page navigations through page_location, and understand various events performed by them on these pages. This granular data can be leveraged to enhance user engagement through personalized recommendations, a critical component in today’s data-driven decision-making processes.

This article will demonstrate how to build a simple recommendation system using the public dataset from GA4 in BigQuery: bigquery-public-data.ga4_obfuscated_sample_ecommerce. By employing straightforward algorithms along with sparse matrices and cosine similarity, we will calculate item similarities and generate user-specific recommendations. This approach not only showcases the utility of GA4 data for developing practical eCommerce solutions but also highlights the ease with which significant insights can be extracted using BigQuery.

Fundamentals of Recommendation Systems

Recommendation systems are pivotal in enhancing user experience by providing personalized content and suggestions. Essentially, these systems analyze the relationships between users and items and predict future interactions based on various patterns derived from historical data.

Components of Recommendation Systems

At its core, a recommendation system is built upon three fundamental elements:

Users (User): These are the entities receiving recommendations. In a system, users can interact with various items, which generates data that recommendation systems analyze.
Items (Item): These are the products, services, or content recommended to users. Items can vary widely depending on the application, from movies, books, and products to news articles and search results.
Relationships (Rating or Interaction): This defines the interactions between users and items. It can be explicit, such as ratings given to movies on streaming platforms, or implicit, such as browsing history or purchase records.

Application to Our Case Study

In our scenario, where we utilize GA4 data within BigQuery for crafting a recommendation system, the components translate into the following:

User (User): Users are anonymous in our dataset, identified through user_pseudo_id. This anonymity is crucial for respecting privacy while still allowing us to analyze behavior patterns.
Item (Item): The items in our context are the product pages that users visit, captured via page_location. These pages are the subjects of the recommendations and play a central role in understanding what content is engaging to users.
Rating (Events_Num): Instead of traditional ratings, our system uses the number of events recorded on each page as a proxy for user interest. This can include any interaction from clicks to time spent on a page, giving us a quantitative measure of engagement.

By focusing on these interactions, our recommendation system can predict which items are likely to interest users based on their past behavior, thereby enhancing their browsing experience on the eCommerce platform.

Data Preparation and Analytical Environment

A robust recommendation system hinges on high-quality data and a flexible analysis environment. This chapter details the processes involved in preparing our dataset and setting up our analytical tools.

Data Source

For this project, we utilize the bigquery-public-data.ga4_obfuscated_sample_ecommerce, a public dataset available in BigQuery. This dataset offers a wealth of anonymized user interaction data from an eCommerce website, making it an ideal resource for developing a behavior-based recommendation system.

Preparing Training Data

To build our recommendation system, we first need to structure our data appropriately. The goal is to create a table where each entry describes how many times a user interacted with a particular product page. This interaction count will serve as our implicit rating metric, indicating user preference.

The SQL query to generate this data is as follows:

CREATE OR REPLACE TABLE your_project.recommendation_ga4.user_page_events AS
SELECT
  user_pseudo_id AS userId,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS pageURL,
  COUNT(*) AS rating
FROM
  `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
GROUP BY
  userId, pageURL
HAVING pageURL LIKE '%shop.googlemerchandisestore.com%'

This query aggregates the total number of interactions each user has with specific product pages, focusing only on those within the ‘shop.googlemerchandisestore.com’ domain. By doing so, we ensure that our recommendation system is tailored to the most relevant part of the dataset.

Analysis Environment

Given that Matrix factorization models require computational resources beyond what the free tier of BigQuery offers, we opt to export our data to a Jupyter Notebook for further analysis. Jupyter Notebooks provide a versatile and interactive environment suitable for complex data manipulations and are widely used in the data science community for such tasks.

In our Jupyter Notebook, we will apply matrix factorization techniques to decompose the user-item interaction matrix into factors that reveal underlying patterns in how different users relate to different items. This approach is particularly effective for generating personalized recommendations based on user behavior.

Model Selection and Implementation of the Recommendation System

Matrix Factorization in Recommendation Systems

Matrix factorization is a class of collaborative filtering algorithms used in recommendation systems. Its primary purpose is to decompose the user-item interaction matrix into lower-dimensional matrices. This decomposition helps to uncover latent factors associated with users and items, facilitating the prediction of user preferences for items that they have not yet interacted with. The reason why matrix factorization is particularly suitable for our data is its effectiveness in handling sparse data, where many interactions are unknown or missing.

Source:https://www.linkedin.com/pulse/fundamental-matrix-factorization-recommender-system-saurav-kumar

Why Sparse Matrices?

In scenarios involving large datasets, such as our user-page interactions from GA4 data, the interaction matrix becomes extremely large and mostly empty (sparse), which can overwhelm system memory if handled in its dense form. Using sparse matrices efficiently represents this data by only storing non-zero entries, significantly reducing memory usage and computational overhead. This approach ensures that our system remains scalable and responsive even as the dataset grows.

Implementing the Recommendation System

The implementation of our recommendation system involves several steps, meticulously designed to ensure optimal performance and accuracy. Below is a breakdown of the code along with explanations for each segment:

Data Retrieval: Here, we set up a query to retrieve data from BigQuery, focusing on user interactions with product pages under a specific domain.

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from sklearn.preprocessing import LabelEncoder
from google.cloud import bigquery
# Initialize a BigQuery client
client = bigquery.Client()
query = """
SELECT userId, pageURL, rating
FROM `yourproject.recommendation_ga4.user_page_events`
WHERE pageURL LIKE '%shop.googlemerchandisestore.com%'
"""
query_job = client.query(query)
data = query_job.to_dataframe()

2. Data Preprocessing: We encode user IDs and page URLs to transform them into a format suitable for modeling, and convert ratings into float for mathematical operations.

user_encoder = LabelEncoder()
page_encoder = LabelEncoder()
data['encoded_userId'] = user_encoder.fit_transform(data['userId'])
data['encoded_pageURL'] = page_encoder.fit_transform(data['pageURL'])
data['rating'] = pd.to_numeric(data['rating'], errors='coerce').fillna(0).astype(float)

3. Sparse Matrix Construction: We create a sparse matrix to handle our large, but sparsely populated, user-item interaction data efficiently.

ratings_matrix = csr_matrix(
    (data['rating'], (data['encoded_userId'], data['encoded_pageURL'])),
    shape=(len(user_encoder.classes_), len(page_encoder.classes_))
)

4. Similarity Calculation: Using cosine similarity, we compute how similar items are based on user interactions. This similarity is crucial for recommending items that are liked by similar users.

item_similarity = cosine_similarity(ratings_matrix.T, dense_output=False)

5. Recommendation Function: This function fetches recommendations for a user by comparing their interaction profile with the item similarity matrix.

def recommend_items(user_id, item_similarity, user_item_matrix, top_n=5):
    encoded_user_id = user_encoder.transform([user_id])[0]
    user_interactions = user_item_matrix[encoded_user_id, :]
    scores = item_similarity.dot(user_interactions.T).toarray().ravel()
    known_items = user_interactions.nonzero()[1]
    scores[known_items] = -1
    recommended_item_indices = np.argsort(scores)[::-1][:top_n]
    recommended_items = page_encoder.inverse_transform(recommended_item_indices)
    return recommended_items

6. Testing the System: We test the recommendation system with a specific user ID to see the output in action.

print("Recommend pages:", recommend_items('7535394.4942370715', item_similarity, ratings_matrix))

Demo output

Recommend pages: ['https://shop.googlemerchandisestore.com/signin.html'
 'https://shop.googlemerchandisestore.com/basket.html'
 'https://shop.googlemerchandisestore.com/Google+Redesign/Apparel/Hats'
 'https://shop.googlemerchandisestore.com/Google+Redesign/eco+friendly'
 'https://shop.googlemerchandisestore.com/Google+Redesign/Apparel/Mens']

The output from the system provides a list of recommended pages based on the user’s past interactions, demonstrating the practical application of matrix factorization and similarity calculations in generating personalized content recommendations.

Conclusion

In this article, we’ve explored how to harness GA4 data in BigQuery to construct a behavior-based recommendation system, utilizing a straightforward matrix factorization approach. This demonstration has not only provided a clear path for creating personalized recommendations from web browsing data but has also illuminated some of the broader applications of machine learning in e-commerce settings.

Challenges in Real-World Applications

While the demo offers a glimpse into the potential of such systems, implementing them in real-world environments introduces several challenges:

Data Streaming and Real-Time Processing: In an operational setting, user data often streams in real-time. Handling this data efficiently, updating recommendation models dynamically, and ensuring that the system can respond in real time are critical requirements that add complexity to the deployment of recommendation systems.
Scalability: As the number of users and items grows, the scalability of the system becomes paramount. The techniques used must be capable of scaling not just in terms of data volume, but also with respect to the speed at which recommendations are generated and updated.
Privacy and Ethics: Ensuring user privacy and adhering to ethical standards in how data is collected, processed, and used is another significant challenge, especially with systems that track user behavior so closely.

Future Directions

Looking ahead, there are several exciting directions in which this work could be extended:

Incorporation of Deep Learning: By integrating deep learning techniques, we could enhance the system’s ability to learn from complex patterns in the data. Deep learning could allow the incorporation of additional features such as user demographics, time spent on pages, and even content analysis of the pages themselves to refine recommendations further.
Feature Enrichment: Expanding the data model to include more nuanced features such as session duration, user engagement metrics, and sequential interaction patterns could provide a more detailed view of user preferences and lead to more accurate recommendations.
Hybrid Models: Combining matrix factorization with other recommendation approaches, such as content-based filtering and rule-based systems, could leverage the strengths of multiple methods to improve recommendation quality and relevance.

This project serves as a foundational step towards building sophisticated, real-time recommendation engines that can drive user engagement and business growth. By continuously adapting to new data and integrating more complex algorithms, businesses can significantly enhance the effectiveness of their recommendation systems, ensuring they remain competitive in an increasingly data-driven marketplace.