Scaling ML — REST & Transformers Don’t Mix

Published in

Ushur Engineering

7 min readJun 22, 2021

Why we dropped REST and picked up queues.

The Ushur Language Intelligence stack (LISA) is a proprietary library responsible for a variety of machine learning capabilities in our platform. This article takes you through our ML services, the challenges with scaling them, and explains how we’re able to serve many (and large) models at scale; including transformers! This is one perspective from Ushur engineering on MLOps.

RabbitMQ

At Ushur, we provide a platform for business users to build customer experience automation workflows. That means we have to simplify ML for our users and expose only the easiest-to-use capabilities when it comes to our models. Simplicity, scalability, and predictability are key for our users.

REST is a popular protocol for serving ML models, even if it’s not particularly well suited for large language models like those used at Ushur. It’s not hard to understand why REST is popular. It’s easy to learn due to its simplicity. It’s flexible in that it serializes data in XML or JSON, and it manages high load with HTTP proxy servers & cache. But REST falters as a way to handle requests when we have to serve large ML models. Why so?

If your ML model server ever goes down, you lose REST requests. When the server restores, the clients are responsible for resending requests.
Today, most of the transformer-based models can have inference latencies in the order of a few seconds on CPUs (and GPUs are costly!). In a microservices architecture, this can result in a lot of back-pressure.
We also let users train their ML models. This is a special case of a long-running task that aggravates the backpressure latencies.

In a distributed environment where more and more services need to communicate with each other, message queues are an ideal solution and help overcome common problems faced by REST services. Requests are queued and can be processed offline without blocking the client, and it improves your ability to handle bursts of heavy activity because the actual processing can be distributed across time.

RabbitMQ is one of the most popular open-source message brokers. It is best suited for long-running tasks and for integrating applications. It works as a middleman between microservices, and we use it for our inter-service communications.

LINode (Language Intelligence Node)

We went with a microservices-based architecture for serving our ML models. By modularizing these services, we can deal with Python and machine learning specific intricacies independently.

The Language Intelligence Node, aka, LINode is the first point of interaction related to all machine learning-based Ushur services. It is a Python multiprocessing-based service, and it’s responsible for all the ML features in the LISA stack.

LINode Multiprocessing

LINode deals with various tasks, and some are CPU intensive. For example, LINode handles letting users train a model. While requests that are dependent on other ML services (for example, sentiment as a service, data anonymizer as a service) are mostly I/O bound. With such varied traits in the incoming requests, we use multiprocessing plus green threads to optimize for latency.

Every task (for example, train, infer, information extraction, and so on) has a set of dedicated processes. With CPU-intensive tasks having a pre-configured process pool size and I/O bound tasks having one process with multiple greenlets. As we scale and support more features in the future, separating every LINode subprocess into an independent service is also a feasible option.

LINode Green Threads

For I/O bound tasks multithreading can be used to spawn worker pools. But regular threads have a relatively high context-switch overhead because of which we will have to limit the number of concurrent threads to a low number. Greenlets do not have this overhead associated with them and let us overcome this problem.

Greenlets (also known as green threads) are lightweight thread-like structures scheduled and managed inside the process. Like POSIX threads, green threads are also a mechanism to support the multithreaded execution of programs.

Regular threads have a relatively high context-switch overhead. This is what the bottle documentation says about greenlets:

Most servers limit the size of their worker pools to a relatively low number of concurrent threads, due to the high overhead involved in switching between and creating new threads. While threads are cheap compared to processes (forks), they are still expensive to create for each new connection.
The gevent module adds greenlets to the mix. Greenlets behave similar to traditional threads but are very cheap to create. A gevent-based server can spawn thousands of greenlets (one for each connection) with almost no overhead. Blocking individual greenlets has no impact on the server’s ability to accept new requests. The number of concurrent connections is virtually unlimited.

Greenlets make a perfect candidate for all I/O bound tasks on LINode because it allows us to serve multiple requests concurrently.

RabbitMQ Connection in Python

Kombu is an open-source Python library. It provides an interface to consume and publish messages from RabbitMQ. It has support for reconnection strategies, connections pooling, failover strategies — and saves us the time of having to build message handling capabilities. We have one Kombu consumer for every LINode subprocess. Every Kombu consumer running in a LINode subprocess is responsible for consuming messages, publishing messages, handling internal errors, and generally ensuring messages are handled correctly.

Ushur ML Serving

With so many common things to take care of, it allowed Ushur to create a higher-level abstraction over the Kombu consumers. We can thereby ensure that any new development can be done with only new application-specific logic while we reuse all existing RabbitMQ and Kombu code. This need and freedom together brought into existence Ushur’s internal Python library — ushur_ml_serving and this section is a detailed walkthrough of the logic behind it. This library currently has two main modules -

Ushur RMQ Module — consists of all Kombu and RabbitMQ consumer abstractions.
Ushur ZMQ Module - consists of all ZeroMQ abstractions.

The Ushur RMQ Module isolates the RabbitMQ and application-specific code in two separate abstractions: UshurRMQHandler and UshurMLWorker.

UshurMLBaseWorker

UshurMLBaseWorker is a worker base class in the Ushur RMQ Module module. A TrainWorker for example, will inherit the UshurMLBaseWorker, and handle all the model training-related code. Essentially, the base class ensures that the input and output formats are consistent. What’s important is that these workers are purposely unaware of all the RMQ and Kombu details. They only deal with application-specific logic. This level of abstraction is essential because it helps new developers just drop in their application code inside a worker and let the library take care of the rest. It also promotes re-use and it makes it easy for us to continually build on top of the existing architecture.

UshurRMQHandler

The RMQ handler does not know how to process particular messages, and so it takes a UshurMLBaseWorker object as input and as a set of directions to figure out how to accomplish that processing. It has a bunch of responsibilities when it comes to interacting with the RabbitMQ LINode connection:

Takes a Kombu consumer as input and ensures it listens on request queues and consumes messages.
Ensures a graceful recovery from any network issues between the RabbitMQ server and the Kombu consumer. The RabbitMQ queues are persistent and requests can be re-read after recovering from the crash.
Processes the message (for example, train a model for a Train consumer) using the input worker object.
Ensures graceful recovery from any application-specific internal errors (for example, inference from a model that has not been trained). It does not let the worker object throw unexpected errors.
Handles loopback parameters. These are parameters in the request body that just need to be passed as is in the response body.
Responsible for publishing the response so that client services can consume messages and continue with the flow.
Uses UshurRMQGreenHandler for the I/O bound tasks which have an added responsibility of initializing and spawning greenlet pools.

Conclusion

REST is a great protocol for building out quick iterations on applications to serve up ML models, but it has trouble as models grow in complexity. State of the Art (SOTA) models seem to be growing, and this problem of scale is clearly not going away. We’ve come up with a solution that works for our microservices architecture and has made it easy for Ushur customers to consume state-of-the-art models on our Customer Experience Automation (CXA) platform without compromising on user experience.

If you have thoughts on using queues to scale ML Ops, let us know! I hope you enjoyed the post. Please stay tuned for more exciting content from Ushur Engineering.