Simple end-to-end ML: From Selection to Serving with Ray Tune and Ray Serve

Kai Fricke
Distributed Computing with Ray
6 min readAug 31, 2020

Let’s be real, in today’s ecosystem going from a model on your laptop to scaling in production is a gauntlet. You’ll spend countless hours cleaning data, designing the model, and tuning your parameters only to have a brand new set of challenges when it comes to serving. You’ll want to monitor the application load, update models in real time (as well as test out new versions), and switch between model versions (and frameworks) with ease.

To do all of this is challenging to say the least, especially because there’s so much complexity in the ecosystem. When it comes to what’s important, there are only two things that truly matter.

  1. Does your model do its job predicting or classifying for your end users?
  2. Can you make these predictions or classifications available to (lots of) end users?

In this post, we’ll cover how you can go end-to-end with two scalable libraries:

Let’s go ahead and get started!

Step 1: Get the best model your can!

Machine learning models require you to choose hyperparameters as these can have a huge effect on your model. These parameters can concern optimizer properties like the learning rate, model parameters like a dropout rate, or the model complexity, like the maximum depth of a decision tree or layer sizes of a neural network.

Ray Tune allows you to use automatic hyperparameter tuning to find the optimal values for these parameters. This part is also called model selection, because we aim to select the best model we can get.

We’ll run through this with the MNIST dataset, just to keep things nice and simple.

Using Ray Tune to get the parameters

As we mentioned above, Ray Tune provides a simple way to scale your ML tuning by enabling you to run on a cluster of machines (or multiple GPUs) with little effort.

Running a hyperparameter search with Ray Tune is as simple as defining a search space and calling tune.run(). Of course, you can also use PyTorch Lightning or other libs as well.

Let’s take a look at what that looks likes:

config = {
"batch_size": tune.choice([16, 32, 64]),
"layer_size": tune.choice([32, 64, 128, 192]),
"lr": tune.loguniform(1e-4, 1e-1),
"momentum"
: tune.uniform(0.1, 0.9)
}
analysis = tune.run(
train_mnist,
config=config,
num_samples=10,
checkpoint_at_end=True)
best_trial = analysis.get_best_trial("mean_accuracy", "max")
best_checkpoint = best_trial.checkpoint.value

See the full code for the train_mnist function and running tune here.

Ray Tune has a ton of cutting edge research behind it and supports features like early stopping to stop bad performing trials in order to save resources.

Step 2: Put that Model into Production

Frequently when it comes to putting a model into production, you’re going to have to use an entirely new tool or ecosystem of tools. This is painful at best. Because Ray is built to be a complete framework for production applications, it has all the tools that you’re going to need to train and subsequently put models into production.

But what is production really?

Different teams and use cases have different production requirements. Some models need to be updated weekly and predictions can be made offline while other models need to be 100% online and updated in real time, all the time.

Luckily Ray Serve supports both of these use cases out of the box. In this tutorial, we’ll focus on how to serve over a scalable HTTP endpoint. Ray Serve does all the work of:

  • Scaling our model over the cluster
  • Giving us an easy way to upgrade the model over time
  • Enabling us to scale up and down

Let’s now get started with Ray Serve. There are two key concepts in Serve, endpoints and backends.

To create a Ray Serve backend, we define a simple callable class that loads our model checkpoint and handles incoming requests:

import torch
import os
class MNISTBackend:
def __init__(self, checkpoint_dir, use_gpu=False):
self.checkpoint_dir = checkpoint_dir
use_cuda = use_gpu and torch.cuda.is_available()
self.device = torch.device("cuda" if use_cuda else "cpu")
# ConvNet is our PyTorch neural network model
model = ConvNet(
layer_size=self.config["layer_size"]).to(self.device)
model_state, optimizer_state = torch.load(
os.path.join(self.checkpoint_dir, "checkpoint"),
map_location=self.device)
model.load_state_dict(model_state)
self.model = model def __call__(self, flask_request):
"""Take an `images` list as input, output predicted digits"""
images = torch.tensor(flask_request.json["images"])
images = images.to(self.device)
outputs = self.model(images)
predicted = torch.max(outputs.data, 1)[1]
return {"result": predicted.numpy().tolist()}

In our example the model is stored locally, but we could also use a distributed storage system, like S3. Now that we defined the “model loading” code, it’s time to create a Serve Endpoint to make it available and a Serve backend to execute the model code.

from ray import serve
serve.create_backend("mnist:0", MNISTBackend, best_checkpoint, True)
serve.create_endpoint(
"mnist", backend="mnist:0", route="/mnist", methods=["POST"])

The backend identifier mnist:0 is arbitrary, but we use it to indicate the first version of this model.

These two functions have now started a HTTP service on our local machine. Your new endpoint is now available at http://localhost:8080/mnist! Although this simple example took only a few lines of code, Ray Serve comes with out-of-the-box support for scalability, batching, and more. Check out these tutorials if you would like to learn more!

Extending to Online Learning

The only thing that’s left to do now is to turn this process into a bit more principled workflow. We do this in the tutorial by using functions that wrap the model selection and model serving parts.

In some settings (and in most toy examples) you have all data you would ever want your model to train on directly available. In the real world, you usually get new data periodically and would like to make your models better using this new data.

When new data arrives, we have two choices: We can either train our whole model from scratch with all data (including the new data), or we can continue to train an existing model with new data only.

In our example we employ a hybrid approach. On the first day we train a new model from scratch. In the next six days, we continue to train our existing best model with the new data. Every seven days we train a new model completely from scratch. If we assume that training from scratch costs more resources than training incrementally, we thus limit the amount of resources we need each week.

Our example uses the MNIST dataset and simulates that new data arrives each day. We then call our training script like this:

# Train with all data available at day 0
python tune-serve-integration-mnist.py --from_scratch --day 0
# Train with data arriving between day 0 and day 1
python tune-serve-integration-mnist.py --from_existing --day 1
...
# Retrain from scratch every 7th day:
python tune-serve-integration-mnist.py --from_scratch --day 7

See here for the implementation of this script.

Building the workflow is then really just gluing the pieces together. Here is a condensed example for training the existing model with new data and serving the new model:

old_checkpoint, old_config, old_acc = get_current_model(model_dir)acc, config, best_checkpoint, num_examples = tune_from_existing(
old_checkpoint,
old_config,
num_samples,
num_epochs,
gpus_per_trial,
day=args.day)
serve_new_model(
model_dir, best_checkpoint, config, acc, args.day, serve_gpu)

Ray Tune and Ray Serve go great together. Serving your tuned models is just a matter of a few lines of code. Hopefully this blog post gave you some idea on how to build your model building and serving pipeline!

Be sure to check out the full tutorial, and if you’re using Ray Tune, Ray Serve, or both in production, please let us know! We would like to hear about what you’ve built! If you would like to learn more about how these libraries are being used, please consider attending Ray Summit!

--

--