Analytics Vidhya
Published in

Analytics Vidhya

Hosting your Text Generating Infilling Micro-Service with FastAPI on GCP

This micro-service allows the artificially generated infilling of multiple words into a context. Researchers from Stanford addressed this. I made their work available through an Application Programming Interface (API).

The following article explains how to implement a NLP architecture trained for infilling. It generates different infilling texts for a given context. This article is structured into a demo of the model, an explanation how to deploy it, and a quick dive into the underlying code concepts.

Auto-generated infilling by the deep neural net. These are suggestions for the _ slot.

1. Testing the Infilling Micro-Service

With this micro-service, you can generate infilling into a specific context by simply adding a ‘_’-token. Such an infilling can be done with words, n-grams (multiple words), or sentences.

Text infilling is the task of predicting missing spans of text which are consistent with the preceding and subsequent text.

This micro-service is based on the Infilling Language Model (ILM) framework which was outlined in the ACL 2020 paper Enabling language models to fill in the blanks by Donahue et al. (2020).

- If you want to test the infilling model, go ahead and click here.

- A guide on how to use this service can be found here.

- The git repository with the source code is stored here.

This is how this Micro-Service looks. You can try it out right there!

2. Deploying Your Own Infilling Micro-Service on Google Cloud Platform

In order to deploy your own infilling micro-service just follow the next simple seven steps.

1. To deploy this micro-service by yourself please create a new project on Google Cloud Platform (GCP). For the new project to be created, you have to choose a project name, project ID, billing account, and, then, to click Create.


2. Next, activate Google’s Cloud Shell in the upper right corner of GCP to execute a few lines of code to host the ILM-API.

Activate the Cloud Shell.

3. Now, we clone the ILM-API Repository from GitHub. Fort that, type the following command into the Cloud Shell:

git clone

4. As a next step, we install the requirements. These project related libraries are located in the file requirements.txt. To install all these modules in one go, run the following command:

pip3 install -r requirements.txt

5. To prepare for the deployment of the FastAPI micro-service on Google App Engine, we need to create a gcloud app. Note: By executing the next command, you will be prompted to select a region. Simply choose the region nearest to the geographical location where your consumers access this micro-service. In the cloud shell, type the following command to create the app:

gcloud app create

6. You are almost there! Now, we actually deploy the FastAPI app on GCP App Engine. While running the next command in the shell, you will be prompted to continue. Press Y and hit enter to proceed. It takes a few moments to host the service (like 3–5 minutes), once you execute:

gcloud app deploy app.yaml

7. Now, you can browse to your micro-service. Generally, your app is deployed on the URL that has the following format: In case, you are not sure what the project id is, then type the following command to view your application in the web browser:

gcloud app browse

Congratulations! You just hosted the your own infilling micro-service based on the Deep Neural Architecture GPT-2 via FastAPI.

Next, we take a brief look at the source code, to better understand what we just did.

3. Remarks About the Code Files

The source code of the micro-service majorly consists of the infilling model, and the following files presented hereafter.

The inference method is wrapped into the ‘class Infiller()’ which we need for the API in ‘’. This loads the data model, sets an exemplary context, appends the specific tokens to the context file, and, runs the inference to generate the infilling sentences.

We serve the uvicorn server through the file via ‘uvicorn main:app’. This hosts the POST APIs for the server queries.


The requirements.txt is very brief but it installs all required dependencies. In this example torch (PyTorch) is being used but you can also use Tensorflow if this is your preference.


Google Cloud Platform allows the App Engine to perform deployment based on a configuration defined in a YAML file. For us to host the FastAPI on Google App Engine, the YAML configuration needs to have the following configuration.


The Dockerfile is being executed via the app.yaml. It hosts a Python3.7 container, copies the app’s source code, installs the requirements needed, and executes the main app.


This micro-service is build on the shoulders of the following giants:

- Git Infilling by Language Modeling (ILM)

- HuggingFace Pipeline

- Deploy FastAPI App on Google Cloud Platform

- Build And Host Fast Data Science Applications Using FastAPI

- Deploying Transformer Models

- How to properly ship and deploy your machine learning model

- Setting Up Virtual Environments


Thanks to Javier and the team of Narrativa that asked me to host the micro-service. Also, a big thanks to HuggingFace, the Team from Stanford around Mr. Donahue. Also, thanks to WifiTribe with which I am currently living as a Digital Nomad. It is a very amazing bunch of people. I am doing my research and education remotely.

Who am I?

I am Sebastian an NLP Deep Learning Research Scientist. In my former life, I was a manager at Austria’s biggest bank. I like to work in on NLP problems.

Drop me a message if you want to get in touch or have any questions. Thank you for giving me your opinion.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

How I update and fix ebooks… automatically

A terminal window showing the result of running the epubdate script with the “help” command-line option.

Prefer Unions over Or in Spark Joins

Types of Joins

How the gas tracking module works? Archway.

S3 File Upload From AntDesign

A photograph of a code snippet

Load Data From Oracle Cloud ERP to Oracle EPM SaaS using PowerShell

FastAPI top-level dependencies

One-way ANOVA: Does the stance affect a UFC fighter’s win by knockout ratio?

Manual Testing and Concepts

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Creating scalable NLP pipelines with Nlphose, Kafka and Kubernetes

Using Kubeflow to solve natural language processing problems

Serving Spark NLP via API (2/3): FastAPI and LightPipelines

Serverless NLP Inference via HTTP API on AWS