Building a real-time recommendation system

Rishav Ray
Vector Database for AI
8 min readJun 16, 2022

Have you ever wondered whenever you click on a new YouTube video how does the suggestions change instantaneously or how does Amazon swiftly switch your recommendations based on your activity of the past few clicks/searches? All these are made possible with the help of real-time recommendations where companies personalise their suggestions for your profile based on your interactions on the fly. This article aims to demystify both the data science as well as the engineering aspects of such systems so that you can get started on your real-time recommendation journey. We will discuss the following sub-topics as we progress:

  • Vector generation
  • Vector search/Candidate set retrieval
  • Ranking
  • System design

Why are real-time recommendations required?

Compared to real-time recommendations, batch recommendations are a lot easier to implement and they are also computationally cheaper. And most of the recommendation use-cases can be solved using a batch system so you might be wondering then why are real-time recommendations such a buzzword these days? The true power of real-time recommendations comes to light when your system depends on context.

Let’s say for example you are watching highlights of football matches on YouTube but suddenly your interest shifts towards watching cat videos, would you like it if the system doesn’t recognise your context shift and continues to recommend you football matches? Similar is the case with e-commerce, maybe you have searched for shoes last week and the recommendations have been very accurately tailored to your tastes but today you may be in the market for a new laptop. You would definitely want the website/app to pick up your interests in real-time and send meaningful suggestions your way.

In the following sections, we will take the example of building a real-time recommendation system for a video streaming platform but the same architecture can be followed for e-commerce as well as for search use cases with minimal tweaks. The terms vectors and embeddings will be used interchangeably throughout this article.

Part 1: Vector generation

In the first stage, we would need to break down the videos in our catalogue into condensed embeddings. For multi-modal use-cases, using CLIP by OpenAI is always a good place to start (the model encodes both text and images into the same vector space). CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs.

One of the major reasons to choose CLIP is because it has excellent zero-shot capabilities and if it doesn’t exactly fit the use-case, we can always fine-tune it. The video can be broken down into images by extracting the most important frames and these can then be converted into image embeddings using CLIP. Similarly we can also convert the tags/description associated with the video into text embeddings. Once we have our image and text embeddings we can either combine them or store them separately (experiment to find out which one performs better for your use-case). You can also try out more recent models like MURAL and Sim-VLR.

Fig: Summary of the CLIP model (source)

Part 2: Vector search/Candidate set retrieval

Now that we have successfully decomposed our videos to vectors, we will see how to efficiently store and retrieve these vectors. For this task we can take the help of multiple vector search engines that are readily available in the market like Milvus, Vespa, Pinecone, Qdrant, etc. (Even Elasticsearch has decided to compete in this race for new-age vector search engines). I have personally used Milvus and can wholeheartedly recommend the same to you.

What is Milvus?

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible and provides a consistent user experience regardless of the deployment environment.

Once the data has been inserted into Milvus, we can efficiently retrieve similar videos using Approximate Nearest Neighbour (ANN) search algorithms. Some of the algorithms supported by Milvus are IVF_FLAT, IVF_SQ8, ANNOY, HNSW.

Index types

  • FLAT: FLAT is best suited for scenarios that seek perfectly accurate and exact search results on a small, million-scale dataset.
  • IVF_FLAT: IVF_FLAT is a quantisation-based index and is best suited for scenarios that seek an ideal balance between accuracy and query speed.
  • IVF_SQ8: IVF_SQ8 is a quantisation-based index and is best suited for scenarios that seek a significant reduction in the disk, CPU, and GPU memory consumption as these resources are very limited.
  • IVF_PQ: IVF_PQ is a quantisation-based index and is best suited for scenarios that seek high query speed even at the cost of accuracy.
  • HNSW: HNSW is a graph-based index and is best suited for scenarios that have a high demand for search efficiency.
  • ANNOY: ANNOY is a tree-based index and is best suited for scenarios that seek a high recall rate.

You can experiment with the different index types and fine tune it according to your use case but using either HNSW or ANNOY is an excellent choice to begin with.

Now for any given user based on his past 5 or 10 interactions (experiment for the ideal number for your use case), we can get an average of the embeddings of the videos that he/she has watched or a weighted average based on recency. With this vector we can query the database to retrieve most similar videos.

By using Nearest Neighbour search, we can now retrieve a set of relevant hundred videos from our catalogue containing millions in a matter of a few milliseconds. At this point business logic such as filtering the retrieved videos based on language and filtering out obscene content can be applied and the results can be directly shown to the end users. But if you would like to better rank the retrieved videos follow along to the next part.

Part 3: Ranking

Fig: Real-time recommendation architecture for YouTube (source)

Candidate set generation is a fast process where we traded accuracy for efficiency and reduced the search space. Ranking is a more meticulous process to select and sort the top suggestions. Here we can use user features (age, gender, category affinity, engagement statistics, etc.) and video features (duration, type, language, number of impressions, etc.) to formulate it into a classification problem (whether the user will click on the video or whether the user will watch it for at least 30 seconds, etc. depending on the use case). The probabilities of this classification problem now serves as the ranking metric for our candidate set. XGBoost or LightGBM might be a good enough algorithm to start while building a classification model.

Now we have successfully used the user and the video metadata to sort and pick the top recommendations for our user. After this if needed additional business logic can be added. To evaluate the quality of the recommendations we can use metrics such as recall@k, precision@k, NDCG and MRR and after we are satisfied with the results we can do an online A/B test to validate our system. Detailed discussion on the evaluation practices will be reserved for another day. In the next part we will discuss the engineering aspect of designing our real-time recommendation system.

Part 4: System Design

Fig: System design for recommendations and search (source)

Before moving forward, let’s know about a few technologies which help power such a system.

Kafka

Apache Kafka is a distributed data store optimised for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.

Kafka provides three main functions to its users:

● Publish and subscribe to streams of records

● Effectively store streams of records in the order in which records were generated

● Process streams of records in real time

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.

Redis

Redis, which stands for Remote Dictionary Server, is a fast, open source, in-memory, key-value data store. Redis delivers sub-millisecond response times, enabling millions of requests per second for real-time applications in industries like gaming, ad-tech, financial services, healthcare, and IoT. Because of its fast performance, Redis is a popular choice for caching, session management, gaming, leaderboards, real-time analytics, geospatial, ride-hailing, chat/messaging, media streaming, and pub/sub apps.

Kubernetes

Kubernetes, also known as K8s, is an open source system for managing containerized applications across multiple hosts. It provides basic mechanisms for deployment, maintenance, and scaling of applications. Kubernetes helps us to fully implement and rely on a container-based infrastructure in production environments. Production apps span multiple containers, and those containers must be deployed across multiple server hosts. Kubernetes gives us the orchestration and management capabilities required to deploy containers, at scale, for these workloads.

Now let us see how these technologies come together to power a real-time recommendation system.

Workflow

Starting from the user end, the information of the interactions done by the user are transferred using Kafka and the last 5/10 interacted video ids are stored in Redis for all the respective users. Whenever there is an API call to fetch recommendations for an user, these cached video ids are then used to create the average embedding as discussed previously. Once we have the query embedding we can do a vector search on our vector database to get our candidate set.

Now the user and video metadata has to be stored in a feature store for fast availability. We can use Redis as a feature store but you can experiment with other alternatives. The user and video metadata is now used by our classification model to generate probabilities for our defined task. The final recommendation set is sorted accordingly and then returned to the user. We can use FastAPI to build all the underlying APIs at every step of the process (You can also use Go to build the APIs since latency is of major concern here). These APIs can then be deployed on Kubernetes so that our system is scalable.

Conclusion

I hope this article enhanced your understanding on how to get started with building a real-time recommendation system. Shifting from a batch to real-time recommendation system may seem like a herculean task but I would say that the benefits outweigh the efforts.

Well, that’s a wrap! Hope you liked it, and if you have any suggestions or questions comment below or reach out to me directly on LinkedIn.

References

--

--