Serving machine learning models with Ray Serve

5 min readAug 25, 2023

Building intelligent solutions that are driven by data and utilizing machine learning models is one of the most interesting emerging topics. But anyone who has tried to build such solutions knows that there are some crucial points in designing the architecture where things tend to get really complicated — really fast. In this post we will cover the challenges that you might encounter when “serving” models to an endpoint and how you can overcome those challenges without introducing and building a complex architecture around your app.

Introduction

Traditionally, running machine learning models in a production environment has involved orchestrating complex setups with components like message queues (such as Redis or RabbitMQ) and worker frameworks like Celery. These systems were stitched together to handle tasks like load balancing, managing worker processes, and maintaining communication between different parts of the application. While effective, this approach often required significant engineering effort to ensure fault tolerance, efficient scaling, and low-latency responses.

Here Ray Serve, is a framework that simplifies this journey. With Ray Serve, the process of deploying machine learning models is streamlined, abstracting away much of the intricate setup. Ray Serve elegantly combines the power of message queues and worker processes under the hood, providing a single cohesive solution for deploying, managing, and scaling models. This approach eliminates the need to manage multiple components separately and dramatically reduces the complexity associated with traditional setups. Ray Serve’s intuitive API allows developers to focus on the core functionality of their models, while the framework takes care of load distribution, fault tolerance, and dynamic scaling. This shift not only accelerates the deployment process but also enhances the reliability and responsiveness of the served models.

Implementation

One of the paramount practices that I’ve personally found invaluable when working with model serving, is the strategic decoupling of the served machine learning model into a separate Docker container. By isolating the model within its own container, it gains independence from the intricacies of the rest of the application. This not only fosters modularity and maintainability but also greatly simplifies the process of scalability and incorporating new models into the system. Each model container becomes a self-contained unit, immune to the changes or updates in other components, enabling seamless scaling without affecting the entire application. This modular approach aligns perfectly with the agile nature of modern software development and harmonizes exceptionally well with frameworks like Ray Serve, where the independent containers can be dynamically managed and scaled according to demand. As we explore the nuances of this practice, its role in simplifying deployment pipelines and accelerating the development lifecycle becomes abundantly clear.

For this we develop the following file structure:

- Dockerfile
- model_deployment.py
- requirements.txt

This simple file structure enables us to lauch a model just by running model_deployment.py.

import json
from typing import Dict

import torch
from starlette.requests import Request

import ray
from ray import serve
from ray.serve.drivers import DAGDriver

from sentence_transformers import SentenceTransformer


# Asynchronous function to resolve incoming JSON requests
async def json_resolver(request: Request) -> dict:
    """
    Resolve incoming JSON requests asynchronously.

    Args:
        request: The incoming HTTP request containing JSON data.

    Returns:
        A dictionary representing the parsed JSON data.
    """
    return await request.json()


# Step 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment
class ModelDeployment:
    def __init__(self):
        """
        Initialize the ModelDeployment class.

        This constructor initializes the class and loads the pretrained sentiment analysis model.
        """
        self._model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

    def __call__(self, data: dict) -> Dict:
        """
        Embed texts using sentence transformers.

        Args:
            data: The input data containing a list of texts to embed.

        Returns:
            A dictionary containing embeddings of input texts, each represented as a list of floats.
        """
        input_texts = json.loads(data)['input']
        embeddings = [torch.from_numpy(self._model.encode(text, convert_to_numpy=True)).tolist() for text in input_texts]
        response = {'data': embeddings}
        return response


# Step 2: Deploy the model deployment.
ray.init(address='ray://localhost:10001')
serve.run(DAGDriver.bind(ModelDeployment.bind(), http_adapter=json_resolver), host="0.0.0.0", port=8888)

And here by changing the values in the decorator serve.deployment you can manage the deployment by adding more replicas (num_replicas), setting health check periods, max concurrent queries and many other available in the docs. (https://docs.ray.io/en/latest/serve/api/doc/ray.serve.deployment_decorator.html#ray-serve-deployment)

The ModelDeployment Class: wraps the pretrained sentiment analysis model using the @serve.deployment decorator. The __init__ method initializes the model by loading the "sentence-transformers/all-mpnet-base-v2" model. The __call__ method takes a dictionary as input, extracts the list of texts to be embedded, and iterates through them, generating embeddings using the Sentence Transformers model. The embeddings are accumulated and returned as a dictionary response.

Deploying the Model: The ray.init function initializes the Ray cluster, specifying the address where the cluster is running. Then, the serve.run function binds the ModelDeployment class to a DAGDriver using the json_resolver function to handle incoming requests. The deployment is hosted on "0.0.0.0" at port 8888.

# Dockerfile

# Set the base image to Python 3.10
FROM python:3.10

# Set the working directory within the container
WORKDIR /app

# Upgrade pip
RUN pip3 install --upgrade pip

# Copy the requirements file and install the dependencies
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

# Copy the model_deployment.py script into the container
COPY model_deployment.py ./

# Expose port 8888 for external access
EXPOSE 8888

# Define the command to run when the container starts
CMD [ "bash", "-c", "ray start --head --block --object-manager-port=8076 --include-dashboard=true --dashboard-host=0.0.0.0 --dashboard-port=8266"]

This Dockerfile is a concise and effective way to containerize your machine learning model deployment. It specifies the necessary steps to set up the environment, install dependencies, and start the Ray cluster.

In summary, this Dockerfile sets up a Python 3.10 environment, installs the required dependencies from the requirements.txt file, and copies the model_deployment.py script into the container's working directory. The EXPOSE directive makes port 8888 accessible from outside the container, allowing communication with the deployed model. Lastly, the CMD instruction launches the Ray cluster with specific configurations, including starting the Ray dashboard accessible on port 8266.

# requirements.txt

ray[serve]~=2.0.1
starlette==0.20.4
sentence-transformers==2.2.2
torch==2.0.0
pandas==2.0.1

Your requirements.txt file lists the necessary dependencies for your machine learning model deployment using Ray Serve. It includes specific versions of various packages to ensure compatibility. Some of theese can be edited based on the time of posting this article.

Now we can use the docker-compose up command, which automatically starts the containerized environment. By employing sudo docker exec -it container-id /bin/bash, you can access a terminal within the running container. This empowers you to engage with your deployment directly, inspect its behavior, and make real-time adjustments as needed. Furthermore, running the model_deployment.py script from within the container enables you to deploy the machine learning model, leveraging Ray Serve's capabilities for smooth deployment and scaling.

Moreover we can configure the Dockerfile to run model_deployment.py on startup deploying the model instantly.

Conclusion

In this post we have quickly demonstraed how we can spin up a sentence-transformer model (or any kind of a model) in a scalable environment without developing complex architectures, just by simply leveraging the Ray serve library.

Model deployed. Happy coding :)

Serving machine learning models with Ray Serve

Introduction

Implementation

Conclusion

Written by Vasil Dedejski