AI in CVS Front Store E-commerce: Building A Complementary Product Bundle Recommender

Published in

CVS Health Tech Blog

11 min readSep 5, 2024

Section I: Problem Statement

At CVS Health, we are constantly seeking ways to enhance our e-commerce platform cvs.com. One promising avenue is to implement a complementary product bundle recommendation feature on our product description pages (PDPs). Imagine a customer browsing for a toothbrush and seeing recommendations for toothpaste, dental floss, mouthwash, teeth whitening kits, etc. — products that naturally go together. A version of this can already be found on our site under the Frequently Bought Together (FBT) section of each PDP.

Traditionally, techniques such as association rule mining or market basket analysis have been used to identify products that are frequently purchased together. While these classical methods are effective, we aim to leverage more advanced recommendation system techniques to create even more meaningful and synergistic bundles. This involves utilizing modern algorithms such as Graph Neural Networks (GNN) and generative AI.

The focus of this article is on extending the FBT feature into FBT Bundles. Unlike the regular FBT, FBT Bundles offer a smaller set of recommendations (a bundle includes the source product plus two other recommendations), with each item in the bundle having strong synergy with the others. We envision this feature algorithmically assembling high-quality bundles, such as:

Self-care Package (scented candle + bath salt + face mask)
Travel Kit (neck pillow + travel adapter + toiletries)

This strategy not only boosts sales but also enhances the customer experience, fostering greater loyalty.

Currently, we do not have the FBT Bundles feature in production, but we are exploring the development of a Minimum Viable Product (MVP). This article outlines our journey and the methodologies we employed to achieve this goal.

Section II: High-Level Approach

The core of our solution involves the Graph Neural Network (GNN) architecture. Inspired by Yan et al. (2022), we adapted their GNN framework to fit our use case, incorporating our own modifications and enhancements.

The implementation consists of three main components:

Product embeddings with a GNN
User embeddings with a transformer
A re-ranking scheme to personalize recommendations

Here is a high-level overview of our technical solution:

Section III: In-Depth Methodology

Part 1: Product Embeddings
Module A: Discovering Product Segment Complementarity Relations With GPT-4

The concept of embeddings plays a significant role in this solution. This novel technique converts text, such as product names, into numerical vectors, enabling machine learning models to understand semantic relationships between words.

We will use a GNN to produce an embedding for each product, such that relevant and complementary products will be closer together in the embedding space. To train this GNN, one would require a product-relation graph.

In Yan et al. (2022), the authors referenced a method by Hao et al. (2020) for building a graph based on user interaction data. This approach analyzed patterns of product co-purchase, co-view, and purchase-after-view. We also experimented with this implementation but found the results unsatisfactory due to extremely noisy data. Specifically, we observed that CVS customers often have multiple purposes when shopping; therefore, purchasing two items in one session does not necessarily mean the items are complementary.

To illustrate this, here are some highlights from an exploratory data analysis (EDA) we conducted:

We sampled 1 million recent transactions involving more than one item. On average, 85% of the items in each transaction came from distinct product categories.
Additionally, we manually inspected 500 randomly selected transactions and labeled which items were complementary. In 81% of these sampled transactions, at least one item was completely unrelated to the others.

The takeaway from our analysis is that transaction data alone is not a reliable method for identifying product complementarity. We needed a more accurate approach, and fortunately, we had the power of generative AI at our disposal (additionally, CVS has an AI governance review procedure to assess and mitigate risks, and we complied with the necessary processes). By leveraging GPT-4, we could ask which other products are complementary to a given product from a list of all available products. However, given that the e-commerce site curates over 25,000 products, including all product titles in a prompt would be time-consuming and expensive. Therefore, we decided to work at a higher level in the product hierarchy tree. The following is an example of what one might look like:

Note that the product catalog at CVS is organized into several hierarchy levels, with SKU at the bottom. As we move up, we encounter levels such as brand, segment, category, etc. We determined that the segment level offers a meaningful granularity to work with. There are approximately 600 distinct product segments. For each segment, we used the GPT-4 API to identify the top 10 most complementary segments from the available list. Running this process locally on a Mac took approximately 2 hours, with an estimated cost of $162 based on OpenAI API’s public pricing:

Module B: Evaluating GPT-4 Output

To ensure the accuracy and relevance of GPT output, we implemented a thorough evaluation process outlined as follows.

1. Prepare a Curated Dataset

Sampling: We took a sample of approximately 10% of all product segments.
Modified Prompt: For this sample, we ran the GPT-4 step with a modified prompt asking for 30 complementary segments instead of the usual 10.
Manual Review: Each segment in the sample was manually reviewed to filter out irrelevant segment mappings, creating a ground truth dataset.
Minimizing Bias: To minimize bias, multiple reviewers would evaluate the same dataset, with the final ground truth being the overlapping consensus.

2. Full Pass of GPT-4

Empirical Mappings: We then ran GPT-4 normally to generate the empirical complementary segments mapping, where each source segment corresponds to 10 complementary segments.

3. Calculate Metrics

Metrics Calculation: Using the empirical mappings and the curated ground truth, we calculated the following metrics:

Precision @ K (P@K): Measures the proportion of the top-k segments identified by GPT-4 that are actually complementary according to the ground truth (k=10).
Mean Reciprocal Rank (MRR): The reciprocal rank of the first relevant complementary segment in the GPT-4 output, averaged over all mappings.
Normalized Discounted Cumulative Gain (NDCG): Assigns higher scores to lists where relevant segments appear earlier.
Hit Rate: Proportion of segments for which at least one complementary segment identified by GPT-4 is correct based on the ground truth.

4. Notes on Metrics

Range: All metrics range from 0 to 1, with higher values indicating better performance.
Caveat: The effectiveness of these metrics relies on the assumption that the ground truth dataset is comprehensively sufficient. While we made significant efforts to curate a high-quality dataset through manual reviews and multiple reviewers to minimize bias, it is important to acknowledge that no dataset can be entirely free of subjectivity or errors.

The following is our evaluation output.

The evaluation results indicated strong performance.

Module C: Learning Product Embeddings

With complementary relationships identified at the segment level, we can now narrow down to the SKU level to build a graph (each node being a SKU). The logic is that if two segments, A and B, are complementary, then all SKUs under segment A would have an edge connecting to all SKUs under segment B.

One of the key strengths of a GNN is its scalability potential, allowing the incorporation of both node-level and edge-level features. For node-level features, we considered attributes such as product sales volume and price. As an edge-level feature, we included the co-purchase count.

Implementation-wise, we followed the model architecture described in Yan et al. (2022), utilizing an advanced version of GNN known as Graph Attention Network (GAT). To train the GAT, we defined a custom loss function that prioritized pairs of nodes with the following qualities:

An edge exists between them
High values in the edge-level feature co-purchase count
High values in the node-level feature sales volume
Low values in the node-level feature price

Practically, we can already make recommendations based on our product embeddings alone using the following logic: For a given product, we retrieve its embedding and then find the top k products in the embedding space that have the highest cosine similarity to it. This approach makes sense because, after training the GAT, the embedding space is structured so that relevant and complementary products are geometrically closer together (see illustration of this idea below). It’s worth noting that the recommendations we make at this stage are generic and not personalized for individual users.

Part 2: User Embeddings

The purpose of user embeddings is to personalize our recommendations at the user level. Here is the logic:

Retrieve Recent Purchases: For each user, we retrieve their 10 most recently purchased products.
Encode Textual Features: We encode the textual features (title, description, etc.) of these products into vectors using Word2Vec.
Transform to User Context Vector: These encoded textual feature vectors are then passed into a transformer that outputs a single vector. We call this the user context vector because it captures a user’s recent purchase history and provides a good representation of their preferences. Each user, therefore, has one user context vector.

Note that currently, we only consider the 10 most recent items from a user’s purchase history for personalization. In the future, we intend to enhance this framework by incorporating additional features such as demographic and socioeconomic variables.

Part 3: Re-Ranking Scheme

The user embeddings obtained from the previous step alone do not provide personalization; we need an additional step to achieve this. Here is the logic:

Identify User Context Embedding: First, we retrieve the user context embedding we want to personalize.
Generate Non-Personalized Recommendations: For a given product, we obtain the top k non-personalized recommendations based on cosine similarity in the embedding space.
Apply Personalization: For each recommended product embedding, we compute the Hadamard product (element-wise multiplication) with the user embedding vector.
Calculate Re-ranking Scores: We average the elements of each resulting vector to get a single re-ranking score (higher is better).
Re-rank Recommendations: Using these scores, we re-rank the non-personalized recommendations to create a set of customized recommendations for the user in question.

This process ensures that the final set of recommendations is tailored to the specific preferences and recent purchase history of the user.

Why does this work? When we combine the non-personalized product embedding with the user embedding through element-wise multiplication, the resulting vector emphasizes the dimensions (features) where both the user’s preferences and the product’s characteristics align. Averaging the resulting values further reduces dimensionality, providing a concrete score that is easy to interpret. This score reflects the degree of alignment between the user’s preferences and the product’s attributes, enabling us to personalize recommendations effectively.

Section IV: Evaluation of Recommender Output

Note that we trained our model using a product-relation graph with unlabeled data, making it infeasible to use classical metrics such as accuracy or F1-score for evaluation. Therefore, we adopted a slightly unconventional offline testing approach by referencing a framework outlined in Chen et al. (2023) named reference-free text quality evaluation with LLMs. For our FBT Bundles use case, the evaluation logic is as follows:

Generate Recommendations: We produce recommendations for each available product.
Evaluate with GPT: We ask GPT to score each bundle (1 source product + 2 recommendations) using the same prompt. GPT is expected to return a score from 1 to 5 (5 being the best) for each metric:

Relevance: Are the recommendations individually complementary to the source product?
Synergy: When considering the bundle as a whole, are all items complementary with each other?

3. Aggregate Scores: After evaluating all bundles, we take an average of the scores for the two metrics to get an aggregated score for each bundle.

Here are the results:

These results indicate that our model is effective in generating high-quality complementary product bundles.

Section V: Use Cases

Implementing a complementary product bundle recommendation system at CVS can drive significant value across its diverse business areas. Here are two key use cases tailored to the company’s operations:

1. Enhancing Sales and Customer Experience: In the retail segment, which includes both front store and pharmacy operations, complementary product recommendations can boost sales by suggesting related products.

2. Advanced Analytics for Improved Decision-Making: The data analytics team can leverage insights from complementary product recommendations to refine marketing strategies and inventory management. For example, analyzing which products are frequently bought together can help optimize stock levels and promotional campaigns.

Section VI: Future Work

Looking ahead, there are several avenues to further enhance our complementary product bundle recommender:

Integrate More Data Sources (e.g., clickstream)
Extensive Parameter Fine-Tuning
Conduct Extensive A/B Testing

Section VII: Acknowledgements

I would like to extend my heartfelt thanks to several individuals who contributed to the success of this project and the writing of this article.

Jyothsna Krishnamurthy Santosh: Thank you for helping me brainstorm and solidify the business-side logistics, for your assistance with presentations to business partners.
Astha Puri & Madhumita Jadhav: Your continuous and insightful feedback has been invaluable throughout this project.
Sarah Boukhris-Escandon: I appreciate your efforts in organizing meetings with our business partners, which facilitated crucial discussions.
Sowmya Vasuki Jallepalli: As our team’s data engineer partner, your help in productionizing the pipeline was instrumental in bringing our prototype to life.

Lastly, I am grateful to our business partners for their constructive feedback, which has helped shape and refine our approach.

References

Chen, Y., Wang, R., Jiang, H., Shi, S., & Xu, R. (2023). Exploring the use of large language models for reference-free text quality evaluation: An empirical study. arXiv. https://arxiv.org/abs/2304.00723

Hao, J., Zhao, T., Li, J., Dong, X. L., Faloutsos, C., Sun, Y., & Wang, W. (2020). P-Companion: A principled framework for diversified complementary product recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20) (pp. 2517–2524). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412732

Yan, A., Dong, C., Gao, Y., Fu, J., Zhao, T., Sun, Y., & McAuley, J. (2022). Personalized complementary product recommendation. In Companion Proceedings of the Web Conference 2022 (WWW ’22) (pp. 146–151). Association for Computing Machinery. https://doi.org/10.1145/3487553.3524222

AI in CVS Front Store E-commerce: Building A Complementary Product Bundle Recommender

Written by Tom Zhang