Using PyTorch Models in Production with Cortex

Caleb Kaiser
PyTorch
Published in
4 min readDec 19, 2019

This post has been authored by Caleb Kaiser, a team member at Cortex.

This year was, among other things, the year PyTorch — only three years old at this point — became the most popular machine learning (ML) framework for researchers.

The Pythonic feel of the framework, the gentleness of its learning curve, and its emphasis on fast and easy prototyping have made PyTorch the clear favorite of researchers. As a result, it is powering some of the coolest projects in machine learning:

  • Transformers, the wildly popular natural language processing (NLP) library produced by Hugging Face, is built on PyTorch.
  • Selene, the cutting edge ML library for biology, is built on PyTorch.
  • CrypTen, the hot new privacy-focused machine learning framework, is built on PyTorch.

In virtually any field of ML, from computer vision to NLP to computational biology, you will find PyTorch powering experiments on the cutting edge.

The natural question, however, is how you can take these experiments and incorporate them into software. How do you go from the “Cross-lingual Language Model” to Google Translate?

In this blog post, we’re going to look at what it means to use PyTorch models in production, and then introduce an approach that will allow you to deploy any PyTorch model for use within your software.

What does it mean to use PyTorch in production?

Running machine learning in production can mean different things depending on your production setting. Generally, there are two categories of design patterns for machine learning in production:

  • Hosting an inference server that serves predictions via an API. This is the standard approach used in common software development — i.e. not mobile software or standalone devices.
  • Embedding your model into your application directly. This is typically used in robotics and standalone devices, as well as sometimes in mobile applications.

If you are going to be directly embedding your model into your application, you should look into PyTorch’s TorchScript. Using just-in-time compilation, PyTorch can compile Python down to TorchScript, which runs without a Python interpreter—very useful for resource-constrained deployment targets like mobile devices.

In most cases, you will be using a model server. Many of the applications of ML you see today—from the recommendation engines behind your favorite streaming services to the autocomplete feature in your online search bar—rely on this form of deployment, and more specifically, on real-time inference.

In real-time inference, a model is typically deployed as a microservice (often a JSON API) through which a piece of software can query the model and receive predictions.

Let’s take a look at Facebook AI’s RoBERTa, a leading NLP model, as an example. It serves inferences by analyzing a sentence in which a word is removed (or “masked”), and guessing what the masked word is. For example, if you were to use a pre-trained RoBERTa model to guess the next word in a sentence, the Python method you’d use to serve inference is this simple:

roberta.fill_mask(input_text + " <mask>")

Predicting the missing word in a sequence, as it turns out, is the exact functionality behind features like autocomplete.

To implement autocompletion within your application, you would deploy RoBERTa as a JSON API, and then from within your application, query your RoBERTa endpoint with your user’s input.

Setting up a JSON API sounds fairly trivial, but deploying a model as a microservice actually requires a good amount of infrastructure work.

You need to autoscale to handle fluctuations in traffic. You need to monitor your predictions. You need to handle model updates. You need to figure out logging. It’s a lot.

The question, then, is how do you deploy RoBERTa as a JSON API without hand-rolling all of this custom infrastructure?

Taking PyTorch models to production with Cortex

You can automate most of the infrastructure work required to deploy PyTorch models using Cortex, an open source tool for deploying models as APIs on AWS.

This article isn’t intended to be a full guide to using Cortex (you can find that here), but the high level is that with Cortex, all you need is:

  • a Python script to serve inferences
  • a config file to define your API
  • the Cortex CLI to launch your deployment

You can see all of the files and tools listed above in the below GIF, which demonstrates how to use Cortex to deploy RoBERTa:

A brief demonstration of deploying RoBERTa with Cortex

And this approach isn’t limited to RoBERTa.

Want to automatically generate alt text for your images to make your site more accessible? You can deploy an AlexNet model to label your images using PyTorch and Cortex (and here’s the code).

How about a language classifier, like the one Chrome uses to detect when a page isn’t written in your default language? fastText is the perfect model for the job, and you can deploy it with PyTorch and Cortex like this.

Using Cortex, you can add numerous ML features powered by PyTorch to your applications for real-time inferencing.

PyTorch in production

There are over 25 research models stored in PyTorch Hub, ranging in focus from NLP to computer vision. All of them can be deployed with Cortex, using the same process we’ve just demonstrated.

The PyTorch team no doubt has more production-focused features on their roadmap, but just looking at the progress that has been made so far, it’s clear that the view of PyTorch not being a framework built for production is outdated.

PyTorch is production-ready.

--

--