Deploy a PyTorch Model with Flask on GCP Vertex AI

A full guided example of deploying a PyTorch model on GCP

Published in

NLPlanet

8 min readJun 28, 2022

Hello fellow NLP enthusiasts! In this article we see how to deploy a PyTorch model on cloud so that anyone can make requests to it! It’s on GCP specifically, but a similar process can be done to deploy models on other cloud services as well (e.g. AWS SageMaker). Enjoy! 😄

First, a small clarification: The Google Cloud Platform (GCP) service Vertex AI is the next generation of AI Platform, with many new features.

These are the steps that we are going to see in this article to deploy a PyTorch model on Vertex AI:

Train and save a PyTorch model
Write the code of the Flask app
Containerize the Flask app with Docker
Test the Flask container locally
Push the Flask image to GCP Container Registry
Deploy the Flask container to GCP Vertex AI
Test the deployed container

Let’s start!

Train and Save a PyTorch Model

To make things simpler, let’s consider the case where we have trained a PyTorch model locally and we simply want to deploy it on cloud.

So, we have trained a Convolutional Neural Network (CNN) to recognize images from CIFAR-10. The CIFAR-10 dataset consists of 60'000 32x32 color images in 10 classes (i.e. airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), with 6'000 images per class. There are 50'000 training images and 10'000 test images.

This is how our CNN is structured.

Then, we save the network state dictionary, which we’ll later load from our inference service.

We could also save the whole model with torch, but then we would need to manage its loading on the inference service taking into account that torch uses pickle, and thus we would need to provide a definition of the ConvNet class from the same module where it was defined in the training script. This is something doable, but the simplest way of managing it is to pickle only the state dictionary (which can be always unpickled without any dependencies) and then load it inside an instance of the ConvNet model in the inference service. We’ll see later how to do it.

Write the Code of the Flask App

Let’s build the directory structure of the Flask app that will perform inference using the trained model. We create an app directory with a main.py file containing the main logic of the app and a model directory containing the trained model. We also add a requirements.txt file containing the required Python libraries.

How your directories and files should look like.

Here follows the content of the requirements.txt file. Notice that we are installing torch and torchvision from a specific index URL, as suggested from the PyTorch website for installing these libraries with CPU support only. Remove those URLs to install PyTorch for GPU as well.

Now let’s fill the main.py file. We import the necessary libraries and create a Flask app.

Then, we add the ConvNet class definition, exactly as we did in our training scripts.

Next, we load the model and instantiate some global variables, such as device (containing the “gpu” string if GPUs are available, “cpu” otherwise), a transform (which is the preprocessing that we apply to inputs of our trained model, exactly the same preprocessing used during model training), and classes (containing string labels for the numerical classes predicted by the model). If you are interested, you can find the full training code in this article that explains how to train PyTorch models on GCP, but it’s not necessary to understand the rest of this guide.

Now, as explained in the custom container requirements article of Vertex AI, we need to define two routes:

A route that signals the health of the Flask app and returns a response with status code 200 if the app is ready for inference. We call this route /isalive .
A route for predictions on new images. We call this route /predict .

Here is the definition of our /isalive route.

As specified by the aforementioned requirements, the /predict route must manage POST requests with a JSON attached with the form { "instances": [<sample_dict>, ...]} . A <sample_dict> is a dictionary containing data about one sample, which in our case it takes the form { "image": [...]} , i.e. it contains Python lists of integers describing an RGB image.

As a response, the /predict route must return a JSON object with a key named predictions and the list of predictions for each sample as value. The body of the function contains the logic for going from input data to predictions: (1) apply the preprocessing transform , (2) use the trained model to predict the labels, and (3) apply some postprocessing to convert numerical labels to meaningful texts.

Last, we start the Flask app.

Containerize the Flask App

Vertex AI creates a prediction service by creating a Docker container from a Docker image deployed on GCP. Therefore, our next step is to create a Docker image of our Flask app and test it locally.

Add a Dockerfile file to the repository structure, which should look like as follows.

Here follows the content of the Dockerfile file. We inherit from a Python Docker image, we install the libraries in the requirements.txt file, we copy the app directory containing both code and the model, and we launch a gunicorn server.

We are now ready to test our Flask app!

Test the Flask Container Locally

Let’s build a Docker image tagged my-docker-api and run a local container from it.

You should see the following lines in your terminal, which indicate that the app has successfully started.

The Flask app is running inside the Docker container.

The app is now running, let’s write some code to test it. For simplicity, I’ve launched a new notebook with Anaconda that I’ll use only for these tests.

Let’s import some libraries and download the CIFAR10 test dataset.

We select a sample from the test dataset and plot it with matplotlib . The selected image is a ship!

Visualization of a sample from the CIFAR-10 test set representing a ship. Plot made with matplotlib.

Let’s first test the /isalive route and check if it answers as expected.

Good, we now test the /predict route with a POST request containing only one sample, i.e. the ship image.

The app answered with ship , great! Everything works as expected and we are ready to deploy our app on cloud.

Push the Flask Image to GCP Container Registry

Next, we need to push our my-docker-api image to the Container Registry of our GCP project. Container Registry is simply a service that stores and manages Docker images securely and privately on a specific GCP project.

First, we activate the Container Registry API. Then, we configure our local gcloud CLI to connect with our local Docker Desktop. Read this guide to install and configuregcloud on your computer.

Now we just need to (1) build our Docker image with a specific tag named gcr.io/<YOUR-PROJECT-ID>/<IMAGE-TAG> and (2) push the image to the Container Registry. In the following command you’ll see my specific project id and image tag, yours should vary.

If everything works as expected, you should see the my-docker-api image inside Container Registry.

Container Registry showing a private my-docker-api image.

Deploy the Flask Container to GCP Vertex AI

Let’s head back to Vertex AI and click on the “Models” section and on the “Import” button.

We are going to create a “Model” resource in Vertex AI by importing our Docker image from Container Registry. Select “Import as new model” and give it a name, such as “cifar10”.

In the “Model settings” section, select “Import an existing custom container” and select the my-docker-api image. Notice that we are not selecting “Import model artifacts into a new pre-built container” because Vertex AI has only pre-built containers for TensorFlow, sklearn, and xgboost models, and therefore we need to provide a custom container for PyTorch.

Then, we specify the prediction route, the health route, and the port that our Flask app listens to.

Leave all the rest with the default values and proceed to create the model. It may take some minutes.

Vertex AI Models section. There’s a new imported model named “cifar10”.

Now that we have created a model resource, we need to deploy it to an endpoint. While a model is a generic resource that indicates how to load the PyTorch model and how to interact with it, an endpoint is like a virtual machine with hardware specs where one or more models can be deployed to.

Click on the three dots at the right of the created model and select “Deploy to endpoint”. Then, select “Create new endpoint” and give it a name.

Next, specify the hardware specs for the endpoint. You may also add a GPU as an accelerator. If so, remember to modify the requirements.txt file to install PyTorch with GPU support and to build and push the Docker image again.

After some minutes, on the Vertex AI Endpoints page you’ll see the new endpoint with “Active” status!

Vertex AI Endpoint section. There’s a new active endpoint named “cifar10-endpoint”.

Test the Deployed Container

We are almost done! Let’s make a call to our new deployed model and receive a prediction.

First, we need the permissions to make calls to our Vertex AI endpoint. Create a new service account, give it the “Vertex AI User” role, create a JSON key and download it. Read this guide to learn more about managing service accounts.

Then, head back to the “Vertex AI Endpoints” page and click on “Sample Request” on the same line of the active endpoint. Visit the “Python” tab and you’ll see where to copy and paste some boilerplate code to make requests to the endpoint.

On a local Jupyter notebook, let’s install the google-cloud-aiplatform library and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the JSON service account file that we created earlier.

Then, we copy and paste the boilerplate prediction code from GCP.

Last, we use that code to make a sample request, to which the deployed model answers with ship as expected!

Next steps

Possible next steps are:

Learn how to train a PyTorch model on Vertex AI.
Learn about MLFlow for reproducible data science experiments.
Learn about MLOps.
Learn about Vertex AI AutoML.

Thank you for reading! If you are interested in learning more about NLP and Data Science, remember to follow NLPlanet on Medium, LinkedIn, Twitter, and join our new Discord server!