Strengthening Data-Driven Decisions for Recommendation Systems

Mert Gurkan
Insider Engineering
11 min readNov 20, 2023
Image Credit: https://online.hbs.edu/blog/post/data-driven-decision-making

Data-driven decision-making can be crucial in guiding a product and a development team. Basing decisions on data and metrics can point out improvement and development areas that may not be visible otherwise. Recommendation system initiatives frequently rely on data-supported decisions when being undertaken. In the post, we will delve into frameworks that enhance our data-driven decision-making ability for the recommendation system.

In this post, we will cover two different systems that enable data-driven decision-making for our Recommendation System. The first system, Fallback Ratio Management, allows us to evaluate the performance of existing recommendation strategies with responses that users receive. The second part of the document will cover the A/B Test System utilized for conducting experiments on recommendation algorithm versions. Both systems provide insights into our recommendations and increase our reporting capabilities regarding the performance metrics of the recommendation system.

Tracking Fallback Ratios of Recommendations

In our Recommendation System, we serve recommendations through Recommendation API. Recommendation campaigns generated through the Smart Recommender Panel construct Recommendation API requests based on the configurations selected during campaign creation.

In some cases, the Recommendation API cannot fetch enough number items to satisfy the requested size configuration in the API requests. In such cases, instead of serving fewer items than the requested size, Recommendation API fills the response with fallback recommendations. For most of the recommendation strategies, fallback options are generated with different strategies than the selected strategy for the campaign. Even though we cannot produce recommendation items that are aimed to be served to users specifically, fallback scenarios aim to cover the original strategy.

Unfortunately, serving fallback recommendations does not guarantee satisfaction for the end user. As recommended items are served by a different strategy from the requested ones, we aim to minimize the cases where we serve fallback recommendations. While we take actions to decrease the overall ratio of fallback recommendations, statistics about serving fallback recommendations to end users were not visible in our system. Tracking fallback recommendation statistics would allow us to evaluate the effectiveness of existing strategies and strategies that could use improvements. Additionally, monitoring fallback recommendation statistics would highlight recommendation strategies that often fail to deliver desired results.

Returning to the Recommendation API, configurations defined in Smart Recommender campaigns are utilized as request parameters and filters for the Recommendation API requests. Then, the Recommendation API fetches recommended items based on the selected recommendation strategy and configurations from the campaign. Below, a sample Recommendation API request and its response are shared in a template format.

GET <recommendation_api_host>/<requested_recommendation_strategy>?size=10&<other_query_parameters>
{
"success": true,
"total": 10,
"types": {
"requested_recommendation_strategy": 4,
"fallback_strategy_1": 3,
"fallback_strategy_2": 3,
},
"data": [
"item1",
"item2",
"item3",
"item4",
"item5",
"item6",
"item7",
"item8",
"item9",
"item10"
]
}

In the response payload shared above, “requested_recommendation_strategy” is the recommendation strategy used in the Recommendation API endpoint. However, in the API response, we have items fetched with three different recommendation strategies listed in the “types” object of the response. The “fallback_strategy_1” and “fallback_strategy_2” refer to fallback recommendation strategies to fill the response with the requested amount. The values of these strategies denote the number of recommended items returned for them. For the example shared above, four out of ten are from the recommendation strategy in the API endpoint. The rest of the recommendations are fetched with fallback strategies.

To increase the visibility of fallback recommendations displayed to users, we created a flow that allows us to observe and analyze Recommendation API responses. The diagram below demonstrates the Fallback Ratio Management architecture.

Detailing the diagram above;

  1. Recommendation API logs all API responses into a Kinesis stream. Here, in addition to the API response served to end users, we enrich the response log with API response time and the number of filters existing in the endpoint.
  2. With Kinesis Firehose, Recommendation API response logs are stored in an S3 bucket with daily partitions.
  3. With a Spark job, we process daily response logs to produce daily fallback metrics. Within the job;
  • First, Recommendation API response logs are aggregated based on partner and requested strategy partitions.
  • Here we have the information about the requested strategy and the number of items returned for the strategy. Additionally, we also have the information about the items fetched from fallback strategies.

4. With this information, following metrics are calculated;

  • Daily number of requests received by the Recommendation API based on partner and recommendation strategy partitions,
  • The overall ratio of serving fallback recommendations to partner and recommendation strategy partitions,
  • The fallback ratio for each index in the Recommendation API response for partner and recommendation strategy partitions. These metrics result in “the ratio of serving fallback recommendations as the Xth item for partner Y for recommendation strategy Z”.

5. We share produced daily metrics with the Data Analytics Team. Reporting and analysis efforts on the metrics are maintained by the Data Analytics team. Developers of the Recommendation System have access to the daily outputs to gain insights about the system.

Internal A/B Test System for Recommendation System

Our recommendation system offers numerous strategies and a wide array of configuration choices for producing recommendations for users. With the number of available recommendation strategies, hyperparameters, and configuration options, there are many possibilities to introduce changes to obtain performance optimizations. However, changing details in the recommendation generation or serving process would be reflected in every campaign of partners. Because of this, experimenting with changes in the system with a smaller subset of users or campaigns was not straightforward.

Additionally, as we increased our pace of developing new versions of the existing recommendation strategies, we again needed a framework for evaluating these versions for a subset of partners before taking new algorithm versions to production. Hence, a framework for A/B testing was becoming a necessity. The framework addresses different requirements from the A/B testing capabilities of the Smart Recommender campaigns. While A/B testing on the Insider panel allows partners to evaluate the performance of Smart Recommender campaigns, we needed an internal A/B testing mechanism to evaluate the recommendation generation process by experimenting with the recommendation algorithms. Thus, changes within the recommendation system are supported by data-driven experimentation.

Image Credit: https://hellodarwin.com/blog/about-ab-testing

The diagram above summarizes the steps for a successful A/B testing framework. The terminology described below and the components of our framework are built on addressing these steps.

Before starting with the detailed description of the developed framework, establishing related terminology can be helpful.

Although titled the A/B Test System, the developed framework detailed in this section also allows for conducting A/A tests. For the rest of the post, “experiment” is used to annotate a singular A/B or A/A test. Each experiment in the framework included a “control” group and a “variant” group. The control group refers to the group that did not receive the changes being experimented with. The variant group denotes the group that receives the changes in the experiment. “Variant recommendations” denote the recommendation results for given variant group configurations. In our initial experiments with the framework, we mainly conducted A/B experiments to evaluate the performance metrics of recommendation algorithm versions. However, the system is not limited to evaluating different versions of the algorithms. The effects of changing hyperparameter values or configurations of algorithms can also be evaluated.

In the center of the framework, we have a new API called ABC (A/B Config) API. In our framework, ABC API;

  • Performs evenly user distribution (based on hashed user IDs) to control and variant groups for experiments,
  • Allows conducting targeted experiments by segmenting user groups,
  • Allows weight distributions on control and variant groups in experiments,
  • Guides the recommendation generation process for variant groups,
  • Guides the serving process of variant recommendations from the Recommendation API,
  • Provides a web interface for creating and managing experiments.

To integrate the ABC API with our components in the recommendation system, the ABC API response object includes necessary fields for guiding recommendation generation and serving processes. Users in the ABC API web interface provide an experimented partner, platform, algorithm, and the experiment status to initiate an experiment. Additionally, users add values for fields named “function” and “middleware”. The “function” field denotes the function needed to be executed for generating the recommended products in the Recommendation Engine, which is our main Apache Spark job that produces daily recommendations in batch format.

For the Recommendation API, the “middleware” field specifies which middleware class needs to be routed in the API to serve recommended items for variant groups in the experiment. Experiments marked as active and having all required fields start in the desired environment.

Below, other integrated components of the framework are discussed in three subsections.

Generating Variant Recommendations

As the first step, we needed a workflow to generate recommendations based on the configurations given for the variant groups in experiments. To establish this, we needed a procedure to pass these configurations to the Recommendation Engine.

To this end, we developed an additional Spark job. The job is responsible for transferring experiment configurations from ABC API to Recommendation Engine and triggering recommendation generation via Recommendation Engine with these configurations. With a daily running Airflow DAG, we automated the process of generating variant recommendations. The diagram below illustrates this process.

The recommendation generation process for this flow is summarized below.

  1. Experiments defined by users are stored by the ABC API.
  2. When the Recommendation Experiment Runner Spark job is triggered, it fetches all active experiments. The job iterates over each experiment to generate variant recommendations based on the values of the “algorithm” and “function” fields. This way, we can generate desired recommendations based on the experiment definition made with the ABC API.
  3. With the configuration parameters coming from the wrapper Recommendation Experiment Runner Job, the required flow is executed within the Recommendation Engine. The resulting dataset here is a variant recommendation that will be served to users who are in the variant group of the experiment.
  4. Lastly, we store these variant recommendation results in the Product Catalog Database to be served to users with Recommendation API endpoints.

Serving Variant Recommendations

Storing and serving recommendations for variant groups in experiments proposed additional challenges. In our recommendation system, we use various fields to record resulting recommendations for item documents in the Product Catalog Database for recording and serving recommendations. Establishing the experimenting capabilities in the recommendation system made it possible for items to have multiple different recommendation results. To solve this problem, we created additional fields in the Product Catalog Database to document of items. Variant group recommendations are updated in the Product Catalog Database with these fields.

To serve variant recommendations for users in variant groups, we also need additional logic in the Recommendation API. The logic needs to route incoming requests based on the group distribution of users obtained by ABC API. To this end, we integrated ABC API with the Recommendation API by implementing a requester service within the Recommendation API. For recommendation strategies supported by the experiment framework, the Recommendation API sends an evaluation request to the ABC API. If the requesting partner has an active experiment, the user is placed in the variant or the control group.

After obtaining the user’s group in the experiment, the incoming request is routed to the middleware class. While the default routing operation is performed for users in the control group, requests for variant groups are routed to the middleware specified by the experiment. This way, ABC API guides Recommendation API to serve variant recommendations for users in variant groups. As mentioned above, recommended items for variant groups are served from the results recorded with the “ab_recommendations” field of items in the Product Catalog Database.

Monitoring Experiments

We also wanted to monitor the performance metrics of the experiments created with the ABC API. It was crucial for us to have the ability to compare control and variant groups of experiments to enable decision-making based on experiment results. To this end, the flow displayed below was implemented. Experiment monitoring is achieved by utilizing Recommendation API response logs and user interactions with the recommended items.

The diagram above depicts the Airflow DAG that manages experiment metric calculations and storing these metrics.

Recommendation API response logs and user interaction data from the Smart Recommender widgets in partner platforms constitute the base data for evaluating variant and control groups in experiments. By merging these data sources, we obtain a dataset with recommended products to the users and their interactions with these items such as viewing, clicking, and adding to a cart.

After establishing this dataset, the AB Test Reporting Job compares the view, click, and CTR metrics from the control and variant groups. The job is triggered daily with an Airflow DAG to store resulting metrics in a MySQL table. Multiple Grafana dashboards utilize stored metrics to visualize experiment metrics at partner and algorithm levels. Grafana dashboards for experiments allow us to track daily metrics obtained from the groups in experiments.

An example of the Grafana dashboard is also shared below.

The line charts above compare impressions and revenue metrics of control and variant groups. Within the experiment period, we obtain time series for these metrics to observe their changes. Also, pie charts display the total values of revenue and impression metrics to compare experiment groups based on their total interactions with users.

Insights Obtained from These Systems

Two systems detailed in this post have been active for around six months. Below are the key insights from these systems and decisions that have been taken with the help of them;

Fallback Ratio Management;

  • It allowed us to track the performance of recommendation campaigns daily. With the introduction of the system, we could also monitor the overall fallback ratio of the entire recommendation system.
  • After observing that we could improve the fallback ratios of almost all the recommendation strategies, our team started a new KR to decrease the fallback ratios of algorithms.
  • It allowed us to evaluate the performance of different recommendation strategies based on their fallback ratios. Recommendation strategies often resulted in high fallback ratio trends signaled future improvement needs for those strategies. In fact, our recent decision to work on a new version of an existing recommendation algorithm is based on the findings of the fallback ratios of the current version of it.

Internal A/B Test System;

  • After releasing the system, applying internal A/B tests became a staple for our workflow for releasing new recommendation algorithms or new versions. We have evaluated new versions of the Substitute and Complementary recommendation strategies by applying and evaluating internal A/B tests.
  • Having metrics of conducted experiments visually available for both developers and product managers provided us an additional entry point for diagnosing the possible issues of experiments. With Grafana dashboards, we were able to track the health of the experiments as well as the problematic metrics of them.

Final Remarks

This post detailed our challenges, design choices, solution implementations, and results obtained with the discussed systems. While the former part of the post focused on evaluating the quality of served recommendation results, the latter discussed a system that facilitates recommendations to be served. As our recommendation system generates recommendations for nearly 900 partners every day, more visibility and insights about the quality can never hurt :)

If you have further questions regarding the two systems, please do not hesitate to raise your questions. You can also reach out to me on Linkedin.

Stay tuned for more news about our Recommendation System, and do not forget to check other blog posts from Insider Engineering. An example post detailing our machine learning flows and Apache Iceberg Feature Store is shared here.

--

--