NVIDIA Triton Unleashed: Elevate Your App’s Performance Game! 🌐🔥

How we compute recommendations in real-time with low latency.

Simon Hawe
ProSiebenSat.1 Tech Blog
9 min readFeb 15, 2024

--

Photo by Thomas Park on Unsplash

Imagine you open a web page or an app and it takes ages to load it. Probably you would be very annoyed. Worse than that, you most likely wouldn’t even wait that long but rather drop off and never come back and tell your friends that this is all crap. That would be a bummer 🤦‍♀! What do we learn from that? We learn from that that loading speed or latency is decisive for an application’s user experience.

However, we often run into these issues in various apps. Why is that? The reason for that is that it can be a huge technical challenge to serve the most relevant content for a user with low latency. This is especially true when you serve personalized content where caching is not a solution to speed up things. All that becomes even more challenging when your personalization is based on machine learning models that have to be executed on request time.

In this article, Victoria (Associate Data Scientist), Nikita (Senior Data Scientist), and Simon (Vice President Engineering) show you; how to serve content recommendations computed with a machine learning model at request time with low latency.

In short, we do some clever engineering 🤓 and use NVIDIA Triton under the hood for the heavy lifting 🏋. Want to know more details? Read on!

Background

At Joyn we use the Two Tower model for creating context-aware content recommendations. This model on a high level consists of two neural nets. The first is called Query Tower, which is used to create embeddings, i.e. long vectors, that encode users and context. The second one is called Item Tower, which creates embeddings that represent items and metadata. Similarity between users and items is measured using the dot product between the created embeddings. As item embeddings can be precomputed, this model is suited for efficient real-time serving. If you are interested in more details, have a look at our previous article.

For model training and serving, we use the NVIDIA Merlin framework. We train our models using Google Cloud Vertex AI Custom Jobs. The created artifacts we export in a format required by NVIDIA Triton inference server which we use for efficient serving. To understand our approach and necessary steps, let’s start with having a look at what exactly we export.

The Training Artifacts

There are three essential parts that we export:

  1. A Keras model for the Query Tower and an NVTabular Workflow for data preprocessing. Both are exported together as an ensemble in triton server format.
  2. The matrix of pre-computed item embeddings from the Item Tower, which we export in feather format.
  3. A JSON file containing the model version string that is a unique identifier for each training job.

All artifacts we put into a Google Cloud Storage (GCS) folder that is unique per model version. Apart from that, we also put them in a fixed-location GCS folder that always keeps “the latest model”. The former we use for debugging and archiving purposes and the latter is the one used in our prediction service.

Speaking of the prediction service, how does this look in more detail and how does it efficiently load the artifacts and reload models in production? Let’s have a look at this in the next section.

Prediction Service Architecture

In the picture below, we visualize our prediction service architecture.

Prediction service architecture (Picture by the authors drawn with excalidraw)

This is essentially a FastAPI application, that receives HTTP requests containing user features and responds with a list of recommendations. Next to the FastAPI app, we have a Triton Inference Server running in parallel. The FastAPI app and the Triton server communicate locally in an asynchronous way via gRPC. The entire service consisting of the app and the inference server is containerized and both run within the same Docker Container but as separate processes. We will debate later on why exactly we ended up with this one-container-two-processes setup.

In addition to that we have a hot storage where we keep real-time values for user features that are not available in the request itself. Such features can be demographics or recent interaction data of our users. We use Google Cloud Spanner to store those as it provides low-latency key lookups as well as querying data with complex queries.

We query this database on request time to enrich the user context used to create user embeddings with the Query Tower.

Service Startup

Considering what we aim to do when starting our service, let’s first recap what we aim to do. We are aiming to compute a score for each item given user data. We get the score as the dot product between item embeddings and user embeddings. Unsurprisingly, we need to have access to those embeddings on request time.

As mentioned, item embeddings are precomputed at training time and loaded on service startup. To get the user embeddings, we have to run the given user context through the Query Tower. This requires loading the corresponding weights and setting up the respective neural network.

With this in mind, let’s break down the startup procedure:

  1. The Docker container entry point launches two parallel processes — the FastAPI app and the Triton server.
  2. Both the FastAPI app and Triton server load the model artifacts from the same fixed GCS location given from an environment variable. Whenever we retrain and hence update our model, the respective artifacts get written to the same GCS location as stored in the environment variable.
  3. The FastAPI app loads the pre-trained item embeddings from a feather file into a Pandas Dataframe global object. Additionally, it loads the model version from the JSON file to a string global object. This is the cornerstone of our model update procedure, which we will break down further.
  4. The Triton server loads the preprocessing workflow and the Query Tower Ensemble.

Steps three and four run simultaneously as the two processes are executed in parallel.

Now, everything is ready to serve incoming requests.

Photo by Count Chris on Unsplash

Note: To run the Triton inference server we use the official NVIDIA Triton Docker image. However, due to its huge size as well as the described data loading steps, the startup of the service can take up to one minute, which is sub-optimal for serverless platforms like Cloud Run.

Serving Recommendations

Now, the service is up and running and ready to serve recommendations. Each request that comes to the service contains a user ID and some optional context features such as device information. As a first step, the FastAPI app enriches this data with time-based features, i.e. time of day and day of week. Furthermore, it obtains user features like age and gender from the hot storage. All this data is then sent as input features to the Triton server via gRPC, where our Query Tower runs.

In Triton, we have the preprocessing Workflow and the Query Tower running together as a Directed Acyclic Graph (DAG). When Triton receives a request with raw user features it first takes them through the preprocessing workflow. There, features undergo the same transformation as during training. This is a crucial step for proper inference and a core strength of using Merlin and NVTabular.

Next, these processed features are fed into the Query Tower, which produces an embedding for the given user features. This embedding is returned to the FastAPI app, where we compute the dot product between this user embedding and all item embeddings. This gives us a score for each item representing how likely this user would like it. Finally, we sort the items by scores and return the largest n as the recommendations to the user.

As you can see, this setup is pretty simple and efficient in serving recommendations, as we only need to perform resource-heavy neural net inference with half of the model, the Query Tower, and only a single user. And computing a dot product is very cheap thanks to Numpy.

Model Update

Until now, we have covered the easy part of starting our service from scratch. However, we re-train a model every day to include new users and items. Consequently, how do we rotate the model seamlessly and most efficiently without incurring any service downtime? Moreover, how do we make sure that both item embeddings in the FastAPI app and the Query Tower in Triton are in sync, meaning they are used from the same model version?

We cannot just make both processes poll the GCS bucket with the latest artifacts independently, as in this case there inevitably would be a time window where the FastAPI app finished loading new item embeddings, but Triton is still using the old Query Tower or vice versa. Why is that a problem? It is a problem because user- and item-embeddings would not be from the same vector space and their similarity would be completely random. And what if we change the model architecture or add new features? All of that would lead to different embedding spaces and random recommendations for the end users. That would be a bummer 🤦‍♀. Here is how we solved this challenge.

For two independent entities to do something in sync they need to “see” each other’s actions to do their part in time. That means that the FastAPI app and the Triton server need to communicate when the model update is running and has finished.

Luckily, Triton exposes an API to control model updates. We can call this API on the local Triton server, to command it to reload the model. What is great is that Triton will still serve the old model, while re-loading the new model. And once it is done one gets notified and the switch between versions is seamless.

As Triton can be commanded and we have the FastAPI app under control, the FastAPI app coordinates everything needed during model updates. But how to do that in a non-blocking way? You guessed it, we use threads for the win.

The main thread in our FastAPI app serves requests asynchronously, while another background thread coordinates model updates. The whole process is depicted in the image below. Let’s get through it step by step.

Model artifacts loading (Picture by the authors drawn with excalidraw)

The FastAPI app loads the item embeddings and the unique model version identifiers on startup. In addition, we launch a background thread inside the app that periodically loads the JSON file containing the model version and compares it with the model version global string object. Once the background thread detects a change in version, it loads the new item embedding to a temporary object.

However, it doesn’t swap the in-memory global item embeddings yet. First, it tells Triton to update the model. Triton now loads the new model from GCS in the background and responds only when it finishes the switch. Mentioning it again, while loading the new model, we can still query Triton and it still uses the old model. After Triton responds with “OK I’m now using the new model“, the background thread swaps item-embeddings and model version global objects with the newly loaded ones.

This way we (almost) always keep item embeddings and the Query Tower in sync and do not incur any downtime. Mission accomplished🕺.

Photo by Jimmy Conover on Unsplash

Results

This setup allows us to serve thousands of requests per minute under 100ms latency per request. Emphasizing again, these 100ms include both Neural Network inference and at least two Spanner queries. All of that runs with only 2 CPUs and without any use of GPU acceleration. We think that is pretty fair 😃.

Wrap up

Effectively adopting a recommender system doing request time computation to your use case and serving it in a production environment scaling to millions of requests is challenging. In this article, we showed how we used the NVIDIA Merlin framework, FastAPI, and Triton Inference server to solve this challenge for Joyn to enable our real-time context-aware content recommendations.

In the next article, we will provide several practical snippets and hacks that you might find useful (or sometimes even necessary) when building your real production recommender system using NVIDIA Merlin and Triton. Stay tuned.

Thank you for following this post. As always, feel free to contact or follow us for questions, comments, or suggestions either here on Medium or via LinkedIn Victoria, Nikita, and Simon. 🙌

--

--

Simon Hawe
ProSiebenSat.1 Tech Blog

Tech and programming enthusiast working at Joyn mainly focusing on data sciene, machine learning, data engineering, and python coding.