The Simplest Way to Serve your NLP Model in Production with Pure Python
From scikit-learn to Hugging Face Pipelines, learn the simplest way to deploy ML models using Ray Serve.
Training a machine learning model is only half the battle; once a model is trained and validated it needs to be deployed in production. This comes with a slew of operational challenges that are far outside the realm of designing neural networks and tuning hyperparameters. There are many existing tools for serving ML models in production but these are often complex to deploy, test, and scale and require a number of trade-offs. For instance, while TorchServe integrates nicely with PyTorch and provides a lot of essential features, you’re probably going to need experience running Java services in production in order to use it successfully.
We’ve watched the ecosystem around model serving go through various iterations — keeping our eye on what makes great infrastructure for serving machine learning models. We’ve concluded that there are several key properties for general-purpose ML serving infrastructure:
- Framework Agnostic — Model serving frameworks must be able to serve TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Why? People use many different machine learning frameworks, and machine learning models are typically surrounded by lots of application logic. Limiting serving to a simple forward pass through a TensorFlow model is too restrictive.
- Pure Python — Model serving should be intuitive for developers and dead simple to run. That means pure Python and no verbose configurations or yaml files. Developers use Python to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This is growing more critical as online learning applications combine training and serving in the same applications.
- Simple & Scalable — Model serving must be simple to scale out of the box across many machines along with simple ways of upgrading models over time. Achieving production uptime and performance requirements are essential for success.
To satisfy these requirements, we built Ray Serve. Ray Serve is built on top of Ray and sits alongside other Ray libraries for hyperparameter optimization, distributed training, and reinforcement learning.
In this post, we’ll highlight Ray Serve’s capabilities by building a simple sentiment classifier with scikit-learn and deploying it as a production service. Then, we’ll upgrade the same service to a more advanced Hugging Face pipeline that uses PyTorch under the hood — with zero downtime. Finally, we’ll conclude with some of the other baked-in capabilities such as scaling out and batching.
To get up and running with Ray Serve, you’ll need to install a few Python dependencies along with Ray. To install everything needed to run the following examples, run the following:
# Install Ray Serve.
pip install "ray[serve]"
# Install the other packages required for this example.
pip install -U requests==2.26.0 s3fs==2021.8.0 joblib==1.0.1 scikit-learn==0.23.0 transformers==4.9.2 pygments==2.10.0
Initializing Ray Serve
First, we’re going to need to start Ray Serve, which runs as a service on top of a Ray cluster. Here, we’ll be starting a single-node Ray cluster for simplicity but if you wanted to scale out to multiple nodes, you would want to start a multi-node Ray cluster first (docs). First, start the Ray cluster:
ray start --head
Tip: Open localhost:8625 to view the Ray dashboard.
Then, to start Ray Serve, run the following Python script:
serve.start(detached=True) runs, it will start up a few Ray actors used by Ray Serve for proxying HTTP requests and routing those requests to the appropriate models. To see this in action, we’ll need to actually deploy a model — let’s jump right into it!
Starting simple: sentiment classification with scikit-learn
We’re big fans of the KISS principle, so naturally when we look to deploy a model we aim to do so in the simplest way possible. Therefore, we’ll start by writing a Python class to serve sentiment analysis requests using a simple pre-trained scikit-learn model (trained using the code from this blog post on building a simple sentiment classifier).
A couple of small notes about the previous code block. First, we’re leveraging core Python APIs, nothing special. You’ll notice that we define two methods, the
__init__ method, which loads our trained model model, and the
__call__ method, which actually processes a given request from an API call.
What’s great about this is that there are virtually no expectations about the input format — it could be raw text (as we have here), a tensor, or anything else that’s convenient. In addition, we can run arbitrary code to process the input or output. In this case, we leverage this to translate the integer class (0 or 1) to a human-readable string.
Deploying the model using Ray Serve
Now we’ll add a couple of lines of code to allow us to deploy this to our running Ray Serve instance. We’ll do this by running Ray Serve API calls in python. This should be run in a Python script that also has the
SKLearnDeployment code that was defined above.
Boom — we’ve got a Ray Serve application! All we’re doing is called deploy on the class that we’ve defined.
Here’s what the architecture of our Serve application looks like after running the above script to deploy the initial serve deployment.
Testing the endpoint
Now we can test the endpoint by sending a request to it (the HTTP server runs at localhost:8000 by default). Here, we’re going to hit the endpoint using the
requests Python library, but this could also be done with any other HTTP client such as
Running the above script (
test_request.py) gives the following result:
> python test_request.py
Result for ‘Ray Serve eases the pain of model serving’: NEGATIVE
While the model is up and running, it returns an answer that’s completely unexpected, this sentiment should be positive — not negative!
Upgrading the model to Hugging Face pipelines with zero downtime
Our first model is running but we’ve noticed a serious issue with the model’s performance: it’s not catching the nuance of how much Ray Serve is easing our pain! Let’s try to upgrade to a better performing model.
Just like Ray Serve makes model serving simple, Hugging Face makes NLP simple! Let’s try to deploy one of their easy-to-use pre-trained PyTorch models instead (you can read more about Hugging Face Transformers here).
Once again we just needed to define a simple class that wraps
transformers.pipeline. Now we all we need to do is deploy it.
After running the above script to update our deployment, it should look like the architecture above since we just replaced our existing deployment with an updated deployment.
Now we can query the endpoint just as before.
This time, it returns the following:
> python test_request.py
Result for ‘Ray Serve eases the pain of model serving’: POSITIVE
That’s more like it! Notice how easy it was to deploy a new version of our model, we just wrote a simple update script in Python and Ray Serve handled the rest.
On top of what we’ve discussed above, Ray Serve provides a number of other features out of the box.
- Here we only had a single replica of each model, but Ray Serve makes it easy to scale up the number of replicas, even across a cluster of machines (docs).
- Our deployment in this example only processed one request at a time, but inference for many ML models is more efficient in batches. Ray Serve comes with built-in batching support that makes it easy to process multiple requests in parallel (docs).
- Often, ML serving systems consist of multiple models composed into a pipeline or DAG. Ray Serve makes it easy to implement complex model compositions by implementing an “ensemble” deployment.
We’ve already seeing a ton of excitement about Ray Serve in just a few short months. If you have any questions, feedback, or suggestions, please join our community through Discourse or Slack. If you would like to see how Ray is being used throughout industry, consider joining us at Ray Summit.