Building a Real-Time Product Recommender System with Graph Databases: Leveraging Neo4j and BigQuery for E-commerce Data Analysis

Majid
Badal-io
Published in
11 min readMay 1, 2023

Introduction

E-commerce has revolutionized the way people shop, providing convenience, accessibility, and a vast array of products to choose from. However, with the explosive growth of online shopping, it has become increasingly challenging for businesses to keep up with the volume and complexity of data generated by customer interactions, transactions, and other activities online.

One important application of e-commerce data is building product recommender systems, which analyze user behavior to provide personalized product recommendations. This can help customers find products they are interested in and increase the likelihood of making a purchase. However, building an effective recommender system for e-commerce data requires understanding the complex relationships between various entities, such as users, products, and sales data. This is where graph databases can play a crucial role, providing a powerful tool for modeling and analyzing these relationships in a flexible and efficient way. These systems aim to enhance customer experience, increase sales, and improve overall business performance.

This article explains how to create a real-time product recommender system for the Google merchandise store using Neo4j and its Graph Data Science ecosystem. The e-commerce data used for this project is publicly available on BigQuery and is provided by Google. You can find the schema and field descriptions by following this link.

What is a recommender system?

A recommender system is essentially a product of the factorization of a co-occurrence distribution matrix. This factorization yields vectors for our customers and products such that the similarities between these vectors are a similarity proxy between customers, products, as well as customers, and products. This similarity is typically measured by the cosine similarity of vectors.

The two primary types of recommender systems are content-based filtering and collaborative filtering. Content-based filtering focuses on the similarities between products; for instance, products that belong in the same category as those previously purchased by the user are recommended. Conversely, collaborative filtering examines the interactions between the users and the store such that the target user is recommended products based on other users who share similar habits and preferences.

Purposes of a graph database for the recommender system

Casting the data in a graph is intuitive for recommender systems as it captures the relationships between customers and products dynamically. Relationships between customers and products that might not have revealed themselves, at least not with a complicated logic, become apparent at first glance as paths that exist between nodes. Customers interacting with products can be represented explicitly through nodes and edges. Furthermore, it is straightforward to include and/or exclude information in a graph structure, and to make adjustments to the initial schema.

Graphs vs. Tabular Data

Graphs make it easier to see connections between different pieces of information.

For example, if you wanted to find a category that is a subcategory of a subcategory of a category, it would be much easier and quicker to do this using a graph instead of a table. Graphs make it easier to find long paths between different pieces of information, like finding all the connections between visitors and the products they have viewed, without having to use complicated logic.

Advantages of Using Graphs for Embeddings in Recommendation Systems

Graphs are intuitive data structures with a lot more expressive power than tables. A graph can take into account a variety of qualities including the weights of the relationships and distinctive attributes of each node. An embedding algorithm running a graph can distill a wider spectrum of both implicit and explicit information on a deeper, more detailed level. What’s more, incorporating additional information into the embedding vectors is relatively straightforward through a graph structure. For example, if we decide that traffic source is an important factor, we can simply add it as a node attribute which can be taken into consideration by our embedding algorithms. On the other hand, traditional embeddings for recommendation systems are simply derived from the decomposition of co-distribution matrices which is limited.

Neo4j

Neo4j is a graph database that uses the SQL-inspired query language called Cypher. Cypher syntax is both logical and expressive in that it matches patterns of the paths in the graph. We will first leverage pattern matching to derive both content-based and collaborative filtering recommendations and will take a step further with the graph-data-science library to create node embeddings.

The extracted fields from the Google Analytics dataset are from the year 2016 and can be divided into:

(1) information on store visitors’ sessions;

(2) product information, such as prices and categories; and

(3) the relationship between the previous two categories — for example, the amount of time a particular session lasted on a product page.

In the graph, a product node holds the product’s id, name, category, and price as its properties. In addition to each category having its own node, categories that can be considered a subcategory to another category are represented with a CHILD_OF relationship. In other words, a relationship exists between a category node and its parent category node. For example, if there is a category called “Men’s Clothing” and a subcategory called “Shirts,” then the “Shirts” category node will have a CHILD_OF relationship to the “Men’s Clothing” category node. Furthermore, a visitor node is specified by the combination of the user_id and their session_id and has their country as a property. A PURCHASE relationship is established when a visitor buys a product for a given session. This relationship also bears the purchase quantity as an attribute. Lastly, the product rating relationship is represented by the duration of time spent on a product’s page in a session. This is collected as a proxy of the rating of that product by the session. The longer the duration of time spent on a product’s page, the higher the rating of that product by the visitor in that session.

The graph data model is represented as follows:

At its first glance, a number of opportunities become evident. One of the simplest recommendations is to suggest products in the same category as the products that the user of interest has previously purchased. This will be a Cypher query as simple as:

For example, a random user that resides in the United States and has purchased an “Android 24 oz Contigo Bottle” will be recommended Google water bottles:

Instead of randomly selecting five products, we can make recommendations from the customer’s most favorite category:

An example result for a customer who has previously purchased $250 gift cards:

We can also narrow down to the most popular products in a category of interest:

As an example, we have Google t-shirts and hoodies offered to someone who has purchased from the category of Apparel:

So far we have explored making recommendations based on content-based filtering. Another method would be: for a product p purchased by a visitor v1, find a visitor v2 who has also visited or purchased p and recommend product q that they visited or purchased of the same category to v1. Instead of randomly selecting product(s) q in the same category, we can easily rank the most popular products and make recommendations:

We can further explore collaborative filtering for recommendations. For instance, a minor adjustment in the above query can yield the customers with the most similar preferences and recommend their favorite purchases (not necessarily from the same category). This similarity in preference is inferred from mutually purchased products:

In another scenario, for each visitor, find other visitors who have rated products similarly and recommend their favorite products to the target visitor. Recall that the time spent viewing a product is a proxy for rating:

This calculates the cosine similarity between customers by examining how they rated mutually viewed products.

Now for a certain visitor v1, we can recommend highly rated products or purchased by similar visitors using this similarity relationship:

You can see how the flexibility of a graph data model and the intuitive syntax of Cypher can enable various use cases. Neo4j provides a data science library package called Graph Data Science. This package comes with several algorithms for machine learning, path finding, community detection, node embeddings, etc. We can take advantage of the embedding algorithms to calculate vectors for our nodes such that these vectors capture the information pertaining to relationships and similarities between nodes. This way, two similar products or visitors end up having high cosine similarities, which will make it easier to detect and offer relevant recommendations.

We use the Vertex AI to host our Jupyter Notebook environment and carry out the experimentation. We first instantiate a graph data science object with a connection to our Neo4j AuraDS instance which houses the data:

As per the documentation, we need to create projections of our graph in order to calculate the node embeddings; we include all node labels and relationships. To highlight the importance of a PURCHASED relationship relative to a VISIT relationship, and the difference in durations of different VISITs, we can assign a new weight property to the relationships and project this property as well. In this iteration, weight is equal to the session’s duration for a VISIT, and is equal to the quantity of purchase multiplied by 9000, which is the maximum duration of any session that had occurred on the website. This factor can be tuned during later iterations if necessary. Moreover, a similar large weight is assigned to CHILD_OF relationships to enforce the affinity of the categories.

The next step is to execute the embedding algorithm over this projection. We chose the Fast Random Projection (FastRP) algorithm due to its accuracy and speed:

The embedding dimension of 256 was chosen according to the documentation’s recommendation, which is based on the size of the graph; that said, the size of the dimension can be experimented with. Once the embeddings are calculated, they can be written back to the graph as a new node property:

We can now make another projection of the graph to run the k-nearest neighbors (k-NN) algorithm, which would yield the most similar products to a given product:

This command will create a SIMILAR relationship between a product p and its top two closest products p1 and p2 as we have set the variable topK=2. It is worth mentioning that this is not a production-ready system capable of providing recommendations in real-time and/or massive scale. It is not practical to create SIMILAR relationships for the entire database every time a new number of products are recommended. Upon the introduction of new products, the similarity relationships need to be recalculated for the entire data.

Vertex AI Matching Engine can import these embeddings that will aid with the fast online similarity search functionality. It offers a scalable low latency approximate nearest neighbors matching (ANN) as a service, which is desirable as our data and recommendations will grow rapidly. To use this service, we can simply provide our embedding vectors from a cloud storage bucket and create indexes to be deployed.

Once the embeddings are calculated, they are stored in a JSON list format, where each line is a JSON object containing an id and its corresponding vector. This file is made available to the indexing algorithm via a cloud storage bucket link:

Once the index is created, we create an endpoint and deploy it:

Now we can query against the deployed index through the online querying gRPC API (matching service) within the virtual machine instances from the same region. The input to this command is a list of embeddings of the objects we are interested in finding as their neighbors. A list of object ids and their respective distance is returned as an output:

As an example, the product “Rubber Grip Ballpoint Pen 4 Pack” returns the following products as recommendations: “YouTube Leatherette Notebook Combo”, “Ballpoint Pen Blue”, and “Maze Pen”.

How are the recommendations from Vertex ai different?
The Vertex index searches an embedding space of products in order to find ones that are most suitable to a user’s taste. This embedding space is achieved via the FastRP algorithm which encodes the wealth of information available in the graph model and therefore is an enriched implicit approach to a recommendation. New information can be simply integrated into the graph model to be accounted for in the embeddings. This approach can offer recommendations that may not be apparent at first glance. Previously, however, we traversed the graph to arrive at recommendations that could be considered a more explicit approach.

Conclusion

In conclusion, the use of graphs as a data representation is a powerful way to analyze data with various attributes and relationships. By using tools such as Neo4j, we have demonstrated the ability to utilize the intuitive Cypher syntax and the Graph Data Science library to create node embeddings and make recommendations using collaborative filtering and content-based filtering. In addition, we showcased how to leverage Vertex AI to host our Jupyter Notebook environment, create projections of our graph, and use the Matching Engine to import embeddings and perform fast online similarity searches. The combination of a flexible graph data model and advanced tools such as Neo4j and Vertex AI allows for complex data structures to be analyzed and accurate recommendations to be made.

Overall, the use of graph databases and product recommendation systems in e-commerce can bring several business values, such as:

  1. Increased sales: By providing personalized and relevant product recommendations to customers, businesses can increase the likelihood of making a sale.
  2. Enhanced customer experience: Customers who receive personalized recommendations are more likely to feel that the business understands their preferences and needs, which can lead to increased customer loyalty and repeat business.
  3. Improved operational efficiency: By leveraging graph databases to analyze customer behavior and preferences, businesses can optimize their product offerings and supply chain operations to better meet customer demands.
  4. Competitive advantage: Having a robust recommendation system can give businesses a competitive edge over others in the market, as it allows them to better understand their customers and provide more tailored and relevant products and services.

--

--