Building a Multi-Armed Bandit System from the Ground Up: A Recommendations and Ranking Case Study, Part II

Published in

Udemy Tech Blog

8 min readNov 17, 2022

Sam Cohan, Principal Machine Learning Engineer at Udemy

Introduction

This is Part II of a two-part blog post about how we built a multi-armed bandit (MAB) system for recommendation and ranking use cases at Udemy. In Part I, we covered the theoretical and practical data science details for such a system. In this second and final part, we will provide an overview of our production architecture and development environment, and discuss the engineering challenges that teams may face when deploying a similar system.

Overview of System Architecture

The high-level architecture of the multi-armed bandit system for recommendations is represented in the diagram below. Note that as the red dotted arrows show, it is a near real-time closed loop system, meaning user feedback is continuously being used to improve the recommendations.

Let’s look more closely at this loop and explain how it works and what components are involved.

1. User interacts with Udemy

The loop begins when a user interacts with the Udemy website or mobile app. As the requested page is being loaded, the current best recommendations are loaded from Redis and served to the user. We will later explain how these recommendations are kept up to date.

2. Events are generated and published to Kafka

As UI elements are loaded and interacted with, events are generated (both from frontend and backend systems) and sent to the Eventing System. These events have tracking IDs, which allow them to be stitched together into funnel events, meaning you can know whether a UI element was viewed, was clicked on or led to an interesting downstream event like a course enrollment or lecture view (i.e., rewards). The tracking system ensures the schema and consistency of the events is maintained across time and that they are reliably published to Kafka topics to be consumed by downstream systems.

3. Streaming apps create observations from the events

There are two Spark Structured Streaming applications dedicated to creating observations (a.k.a., rewards) from the Kafka events. The first app is the Funnel Join App, and as the name indicates, it performs a streaming join to create joined events (a.k.a., funnel events), which combine impressions, clicks and downstream events like enrollments to answer the question, “Did showing a recommendation lead to a reward in a fixed amount of time?”

These joined events are published back to Kafka and consumed by a second Spark Structured Streaming application called the Observation App. The job of the Observation App is to apply sessionization logic to the joined events to create the final observations. This app makes use of Spark’s stateful stream processing and writes the observations to a new topic on Kafka to be consumed by the Bandit Model Apps.

4. Bandit model apps use the observations to update recommendations and refresh Redis

The final set of Spark Streaming Applications are the Bandit Model Apps. We designed these apps to support several bandit algorithms, but in practice, we saw that the Thompson Sampling algorithm was the one that produced the best results. These apps consume observations from Kafka in micro batches, update the arm reward probabilities and update the resulting recommendations in Redis. Also note these apps log all of their intermediate states to a database so they can be used for offline analysis later. And thus, the full cycle is complete.

Candidate Generation Workflow

You might be wondering how the bandit model apps know what candidates (a.k.a., arms) to consider for the recommendation use case. Well, that’s done via a Candidate Generation Workflow, which checks some configuration tables to grab a list of possible UI elements and their source pages. These UI elements can be courses, carousels of courses, etc. The workflow is scheduled to run nightly to refresh the arm set, but the system is designed to be robust enough that the refresh can happen in an ad-hoc manner any time.

Monitoring, Alerting and Supervision

To make sure the multi-armed bandit recommendation system remains in a healthy state around the clock, we built a real-time dashboard of vital metrics such as count of records written to each Kafka topic and Redis writes and Redis read latencies. We defined monitors on top of these metrics and hooked the alerting to Slack and a paging solution. In addition, we have a supervisor job that is scheduled to check the status of the streaming application every five minutes and restart them if they have failed or are in a bad state.

Engineering Challenges

Learning Curve of Developing Real Time Systems

One of the most significant engineering challenges for this project was overcoming the steep learning curve of the involved technologies. Things become an order of magnitude more difficult to think about when you go from batch to real-time systems. Specifically, if you are from a Python background and do not have extensive experience with the JRE universe, the tooling alone takes some time to get familiarized with. On top, you have to deal with the idiosyncrasies of Scala (a strongly typed functional language ) and Spark Structure Streaming (a streaming framework with often long and cryptic error messages).

In addition to the challenges resulting from language and frameworks, the very nature of streaming applications poses its own unique challenges for development and testing. It is important to invest in tooling and environments that can help increase your speed of development. For example, you may want to consider how to have a minimal end-to-end system on your local machine to be able to quickly iterate on and test your code (hint: containerization is your friend). The full details of these efforts are out of the scope of this article, but needless to say, this is a pretty significant effort that should not be underestimated.

Maintaining Uptime and Support Efforts

One of the main differences between batch applications and streaming ones is that streaming applications have to reliably run around the clock and are more susceptible to subtle bugs which can compound over time and cause the application to crash. This is challenging because it requires dedicated on-call support who can be paged at anytime — day or night, holiday or not. Of course, having good visibility into the health of the applications via extensive dashboards that monitor various metrics, and having decent alerting and on-call schedule are table-stakes for launching such a product. But the key to successfully supporting such a product without driving the support engineers insane is to design your applications with the assumption that they will inevitably go down, and so you must handle restarts gracefully — and ideally automate them.

Of course, Spark has a built-in checkpointing system, which should in theory help with the graceful restart scenario. However, in our experience, it is not uncommon for checkpoints to become corrupt and prevent the application from restarting. So, to avoid data loss, you must have application logic that can recover the state of your streaming application from an external database. In our case, we snapshot the bandit arm states and Kafka offsets to an external database, so if the Spark checkpoint is corrupted, we can safely delete it and load the last valid arm states from our backup database. This recovery logic is baked into a supervisor job, so 99 out of 100 times that is good enough to get things going without having to escalate to the on-call engineer.

Future Directions

Reward Enhancement

Currently, we use a weighted sum of clicks and enrollments as the reward function for our multi-armed bandit system. We have seen this reward definition has generally correlated well with our desired online A/B test metrics. However, there are several modifications of our reward function that we could make to further improve our bandit system’s ability to learn.

Add new rewards such as wishlist and add-to-cart events, which could provide more frequent high-intent signals beyond just enrollments to help MAB differentiate between arms more easily.
Use predicted long-term rewards as an input into the decision-making process to better handle delayed or long-term feedback. One way to do this would be to capture short-term signals and use them to predict a long-term metric such as user lifetime value. Then, the bandit could use this predicted reward in the short-term decision-making for which arm to select next.

Contextual Bandits

In addition to reward enhancements, our main next direction will be to evolve our MAB system into a contextual bandit system. In a contextual bandit system, every arm-reward pair is now coupled with a context vector. This context vector can include information such as user-specific features, page features, item content and so on. By including a context vector, we can personalize the MAB decision-making process. MAB’s decision on what and when to explore versus exploit will now be dependent on both the reward history and the context which led to the reward. In the unit ranking example, instead of having a single dynamically updating global ranking for all users, each user will have a dynamically updating ranking that is specific to them.

From a systems perspective, there are two main new components required to handle contextual bandits. The first is a Realtime Model Service, which can handle both inference as well as incremental training. The second is a Realtime Feature Service, which can serve the contextual features to the model for inference. Note that this is a significantly more challenging setup because whereas before, the predictions were updated in micro batches and served via Redis, now the personalized predictions need to be calculated on the fly as the page is loading (i.e., with sub 50 ms latency). Additionally, the model service is not your typical inference-only model service; it has to also handle model updates without interrupting the inference SLA.

Concluding Remarks

In this two-part blog post, we presented a wide range of material covering what multi-armed bandits are, why they are so useful in recommendations applications, how to treat a ranking problem as a bandit problem in a real-life case study and what it takes to build a near real-time, fully productionized MAB system that serves millions of users at a time. We provided an overview of our architecture, discussed some data science and engineering challenges that can arise when building a MAB system from scratch and shared our thoughts for the future evolution of our system.

Bandits are a powerful tool with demonstrated success, but they can be particularly challenging to implement successfully in an industry environment, especially in a scalable, near real-time way. We hope these blog posts can help others to understand the various challenges that arise in the development of such a system, and serve as a guide and launching point for teams to build their own MAB system from the ground up.