ML horoscope generation pipeline as a REST API using GPT, Transformers, Fast API and GCP (Part 2)

Kamen Zhekov
7 min readDec 9, 2021

Hello again! If you’ve missed the first part of this series, you can read it by clicking on the box below. The first part covers the introduction to NLP and the overview of the project as a whole. It also covers preparing the data for training, an introduction to the technologies and tools used for fine-tuning the model, the training itself and the link to a small demo website that shows what kind of horoscopes can be generated with the end result.

This part will cover the model wrapper used for text pre-processing and post-processing, the API created with FastAPI and serving the API through the Google Cloud Platform.

If you wish to follow along the steps and recreate your own version of the API, feel free to read on, but if you’re more interested in the whole code and a working version of the API, you can head to the project’s GitHub page.

The Model Wrapper

The idea behind the model wrapper is simple, we are using a PyTorch model and tokenizer for our predictions, but that’s all it does: predictions. We need an interface that will take care of preprocessing the input the API would send the wrapper in the same way the model received its training sentences.

It will also take care of post-processing, to make the horoscopes look better, with good capitalization, formatting and ironing out some of its glitches.

First things first, loading the model that we’ve trained in the previous part. The model is saved and loaded through PyTorch’s very convenient save(…) and load(…) functions.

The model wrapper will use a zodiac mapping, because the API will be sending us an integer that corresponds to the zodiac sign we want to generate a horoscope for. Later on, we will transform that integer to the correct zodiac sign and feed a specific input format to the model.

We won’t be training the model on the fly, so we can cache the loading function’s result, in case future functionality requires it to be called multiple times.

It’s important to check if the server we’re using has a GPU available, as it’s very important to the speed of the predictions. We’re of course not hosting it on a GPU-ready server (because we want a free server for a proof of concept) but it’s always a good idea to be prepared for scaling your models, and using a GPU is one of the easy ways to do it.

Next comes the post-processing, where we use regex expressions and standard string manipulation to format the output of the model correctly.

This includes:

  • Removing the zodiac that is at the start of every horoscope generated by the model, because the user already knows what sign they are
  • Normalizing capitalization, because sometimes the model capitalizes letters that should be lowercase, probably due to some bad examples in the training data
  • Making sure the punctuation is there when merging horoscopes, which is good for when you want to output long horoscopes instead of shorter ones

And finally, we generate the horoscopes using the tokenizer to process our input which follows the format “{zodiac_sign}, {horoscope}” as defined during our training.

The default behavior, if a wrong input or nothing is given to the model to generate the horoscope, is to generate one for the first zodiac sign in the mapping: Aries.

The model is then used to generate the horoscope with a few fine-tuned parameters (top_k, max_length, top_p and temperature) which are very specific to GPT, and are very well explained on Huggingface’s GPT Neo repository.

In the end, the model’s output is decoded and post-processed, resulting in a pretty well-written horoscope, albeit probably wrong since our model can’t read and interpret star alignment (yet).

So, as you can see, the wrapper is pretty minimalistic and easy to implement!

The API

The API basically has one purpose — serve as an intermediary between the website or application that needs the horoscopes and the model wrapper.

It’s functionality is therefore very basic. It takes an integer, which corresponds to the horoscope ID and uses the model wrapper to generate a horoscope for the given ID.

Most of that is taken care of by the FastAPI library, so we just need to define the URL to use for the predictions, the Request and Response types we will be using for the input and output, and set up the CORS Middleware so we can make requests through websites and apps that are not hosted on the same server as our API.

And that’s basically it! It’s a very simple API, since all of our functionality is taken care of by the model wrapper. If you want to fiddle around with it and try out the API, you can use the Swagger UI that’s automatically set up by FastAPI when hosted through the link below.

Hosting it on GCP

Google Cloud Run is a fully managed platform that takes a Docker container image and runs it as a stateless, auto-scaling HTTP service. This means that we need to create a Docker image for our API, so that the deployment is as simple as possible.

The difference between Cloud Run and the first- generation of serverless platforms, such as AWS Lambda, Google Cloud functions or Azure functions, is that it allows you to run arbitrary applications serving multiple endpoints, not small functions with a specific interface. The old services wouldn’t be able to host a small ML model such as ours, but Cloud Run can (and for free!).

Once our API and model wrapper are packaged in a Docker container, all it takes is to:

  • Push the container image to Google Container Registry
  • Deploy it through the Cloud Run interface on GCP

Dockerfile

Creating the Dockerfile is very easy for what we’re trying to achieve. Docker has a huge amount of functionality, but we’re only going to do the very basic stuff for our API.

The Dockerfile (that’s what you should name the file itself, no extensions) is located in the /api folder and every line gives an instruction of how to containerize the application.

  1. Set the application’s programming language to Python 3.9
  2. Set the working directory to /code
  3. Copy the requirements.txt, which lists all the Python libraries the model depends on
  4. Install all the dependencies
  5. Copy everything to /code
  6. Run the API using Uvicorn, a lightning-fast ASGI server that is recommended for FastAPI implementations

And that’s it! If you want to learn more about how powerful Docker is and why it’s used for millions of applications, you can check it out below.

Hosting it on Cloud Run

The first condition for hosting our API is met — we have a Docker container. Now let’s look at actually hosting at on GCP. We are going to use Cloud Run’s free tier, because we want to pay as little as possible for this proof of concept.

So far, the project has cost me around 7 cents (over 3 months), so I’m pretty happy with that. Bear in mind that even with the free tier, you need to set up an actual billing account. Click here for an overview of the free tier services offered through GCP.

Now, the best source to set up your Cloud Run instance is of course the tutorials made by Google, so if you’re not familiar with the service, go ahead and click on the link below. It will take you through the whole setup for Cloud Run. For me, the simplest way to do it was through the GCP Console, but if you prefer scripts or CLI, then you’re all covered in the tutorial below.

If you’re having trouble deploying your Docker image to the GCP Artifact storage, since you’ll need to deploy it over there, this guide really helped me out.

After you artifact is uploaded, you can use the following configuration for deploying the Docker container, and everything should work out of the box when the API has loaded.

First, choose your Docker image after uploading it to GCP Artifacts. Pay attention to the Region you are choosing, us-central1 is part of the free tier, but hosting in Europe for example isn’t.

We want the number of instances to be set to 0, which makes the API less responsive, but you pay way less in memory and CPU usage, since they’re allocated only when a request comes in.

Here, we set up the port we are going to use for the API requests. Previously, we’ve configured it as port 80, so we will be using that one. A more common one is 8080, or pretty much anything you want, but it has to be the same as in the Dockerfile uvicorn command.

We only need 1 CPU, since this is a proof of concept, and 2 GiB of RAM as more than enough to spin up the container. It might work with 1 GiB, but better safe than sorry!

The rest of the parameters can be left to their default values, and your API should run without a problem.

Thank you for reading, and have a nice day 😊

--

--

Kamen Zhekov

I am a Python Engineer with experience in ML Engineering, full stack and API architectures. Currently, I am working with ACA Group's amazing Python team!