Creating a Paraphrase Generator Model using T5 and Deploying on Ainize

Jeff
12 min readNov 29, 2021

--

👋 Intro

This tutorial aims to give you a comprehensive walkthrough on modern NLP, from data collection to deploying a web app on Ainize! We’ll do this by creating a paraphrase generator model that allows the user to vary the output using the T5 architecture. We’ll then use FastAPI and Svelte to create the web application demo shown below.

If you’d like to try out the demo before reading the tutorial, check it out here.

In this tutorial you’ll learn

  1. About Paraphrasing and how to collect paraphrase data
  2. Fine-tuning T5 using PyTorch-Lightning
  3. Creating a simple web-app using Svelte and FastAPI
  4. Dockerizing
  5. Deploying on Ainize

About Paraphrasing

The goal of paraphrasing is to change the surface structure of text (i.e., the words used and their arrangement) while preserving the underlying meaning. Paraphrasing has many different types of applications. Some of these applications include:

  • Helping people improve their writing
  • Data augmentation for text
  • Plagiarism detection
  • Natural language understanding tasks

Data Collection

Before training the model, we must first collect our dataset. To do this, we’ll use two different sources. First, we’ll download a pre-existing dataset called Paraphrase Adversaries from Word Scrambling (PAWS). Second, we’ll create our own dataset using a method known as back-translation.

PAWS

PAWS was designed for paraphrase detection. As such, it contains both human-validated paraphrases and texts that are grammatically similar but have different semantics. Since we’re building a paraphrase generator, we’ll only require the former.

To collect this data, we’ll use HuggingFace’s datasets available here and extract the labeled paraphrases using the following code.

Let’s take a look at the first item in the dataset.

Here, the “Source” text refers to the item that we wish to paraphrase, and the “Target” text refers to what the model will attempt to generate during training. We’ll keep this same format throughout.

Back translation

Perhaps a more interesting way to collect data is by using a method known as back-translation. If you’re unfamiliar with the concept of back-translation, it refers to translating a piece of text into another language (usually referred to as the pivot language) and then BACK to the source language. The reason this can work for paraphrasing is due to how different languages are structured. They frequently have different grammatical layouts and combine multiple words from the source language into one word or vice versa. This allows the back-translated text to have similar semantics while offering a different surface structure.

To perform back-translation, we’ll use Facebook’s many-to-many mBART model made available through HuggingFace. While you’re free to try any language as the pivot language, I’ll be using Russian due to it being found to contain a low number of errors and offering a sufficient number of changes in one study. As for the text we’ll be back-translating, I’ll be using sentences sourced from several OpenStax textbooks made available to download here. This is an easy dataset I have available locally. Feel free to use whatever dataset you like.

Let’s start by initializing the MBART model.

We may now create the functions used to perform back-translation. For the model to understand what language we’re translating from and to, we must explicitly tell it using language codes. I’ll set English as the source and Russian as the target as defaults for simplicity.

Quality Checks

Since back-translation is an automated method, it’s a good idea to perform some quality checks to ensure that each back-translation sentence is a good paraphrase before adding it to the training data. Below are the quality checks I use for this project. Of course, you could increase the strictness or add more for better results.

  1. Levenshtein distance: Levenshtein distance is the minimum number of changes required to change one string into another. For example, if you were to perform Levenshtein distance at the character level, the distance between the strings “cat” and “rat” would be one because the only operation required is a single substitution. You can learn more about this here. For this project, rather than looking at the character level, we’ll instead perform Levenshtein distance at the word level and will require a distance of a minimum of 3 operations. We’ll do this using the pylev library. The goal here is to ensure that the back-translation changes the structure enough to count as a paraphrase.
  2. Sentence Similarity: Sentence similarity describes how semantically similar two sentences are together. In this case, we want to ensure that the back-translation doesn’t significantly alter the semantics of the source text. We’ll do this using the sentence-transformer library. Sentence transformers allow us to embed sentences into vectors. These vectors then will enable us to perform cosine similarity to see how closely related they are.
  3. Grammar Check: The end goal is for the paraphrase to be somewhat grammatically fluent. Therefore, we’ll perform a grammar check using the LanguageTool library to ensure there are no grammatical errors.

The functions used for this can be seen below.

Allowing user variation

An additional thing I’d like to add to this project is to allow the user to vary aspects of the generated paraphrase. This variation is already implicitly in the training data. We just need to define the types of variation and prepend it to the source text so the model can learn them. There are two types of variation I’ll define:

  1. Word length: Allow the user to reduce, match, or expand the source text. This variation is performed by counting the number of words in the source and target text and then prepending the desired label. For example, if the source text is larger than the target text, it’ll receive the label “reduce” as the paraphrase effectively reduces the number of words. Note that due to how tokenization works (see the section below), the results here are unlikely to be great without a large amount of data.
  2. Distance: We can also allow the user to vary how different the paraphrase text is from the original. We can do this using the Levenshtein distance described above and defining arbitrary ranges that each receives a label. These labels are as follows: small, medium, large, and gigantic. To see the definitions of these labels, check the source code below.

We can implement this by creating the following functions:

As an example, if the target text had fewer words than the source and only had a moderate Levenshtein distance, we would prepend the following to the source text:

"Paraphrase: medium changes, reduce input."

Final Dataset

Putting it all together, we can create our final dataset by looping through the OpenStax sentences and calling the functions created in this section.

The final dataset can be downloaded here. If you’d like to view the full source code, you may do so here.

T5 Model

About T5

T5 is a sequence2sequence model created by Google that utilizes both the encoder and decoder sections of the popular transformer architecture. As the name implies, seq2seq models are used to map one sequence of text to another. In this case, we’re mapping the source text to a paraphrased version of this text. Let’s briefly cover some key concepts about T5.

Key Concepts

  • Tokenization: Tokenization makes text machine-readable. While there are a few different ways to perform tokenization, the primary idea is to give each word/sub-word in a vocabulary a numerical ID. Let’s look at an example to make this more clear. Using the T5Tokenizer, the word “paraphrase” would first be broken down into [para, phrase] due to “paraphrase” being a somewhat rare word. It would then be encoded as [3856, 27111] with 3856 representing “para” and 27111 representing “phrase”.
  • Encoder: The encoder takes in the tokens of the source text (e.g., the sentence we want to paraphrase) and then outputs a vector for each of these tokens. The idea of these vectors is that they contain some meaning about the token and the context in which it was used in the source text. They’re then fed into the decoder that is able to refer to them during generation.
  • Decoder: As mentioned, the decoder receives the vectors embedded by the encoder as inputs. From here, it is able to autoregressively generate the output (e.g., the paraphrase) similar to decoder only model such as GPT. If you’re unfamiliar with autoregressive models, this just means that the model uses previous outputs as inputs for the next step in the generation process. The only major difference here compared to GPT is that it is referring to both the encoded vectors and its own outputs during generation.

If you’d like to learn more about Transformers, HuggingFace offers an excellent course going over the basics in great depth.

Fine-tuning T5 on our data

Fine-tuning describes the process of using the weights of a pre-trained model and then re-training it on your own dataset. This is often done because the model has already learned some aspects about the world during its initial training. And thus, we can leverage that knowledge on downstream tasks rather than re-teaching the model everything from scratch.

To fine-tune T5, we’ll use the pre-trained T5-base model available on HuggingFace and then train it on our dataset using PyTorch Lightning.

Defining the trainer and and training the model:

The full code can be seen here. If you run yourself, be warned that it could take a few hours to run!

FastAPI

While there are several ways to serve our model, I’ll use FastAPI for this tutorial. FastAPI is an easy-to-use, high-performance backend framework. It’ll effectively allow an easy way to call our model and make it available for other applications to use.

To create a FastAPI server, let’s make a Paraphrase.py file and create a post request function that takes in a dictionary containing the source text and numerical keys that indicate the word length and distance labels. Additionally, I’ll add the get_similarity and get_distance functions we used in the data collection section to serve the user metrics to indicate how the model is performing.

To run the model, cd into the folder with the FastAPI file and type the following:

uvicorn Paraphrase:app --reload

By default, our application will run on localhost:8000. However, this will change once we deploy on Ainize and it provides a new URL to call our model.

Web App using Svelte

To create a frontend demo for our model, I’ll use Svelte. If you’re unfamiliar with Svelte, it’s a frontend, compiled JavaScript framework that is very fast and easy to use. If you’d like to learn more about it, check out the official Svelte tutorial that offers an interactive, comprehensive rundown.

To get started with Svelte, let’s first create a blank project. To do this, simply type the following in your desired directory:

degit sveltejs/template paraphrase-project

Cd into paraphrase-project and type the following in the command line to install all dependencies.

npm install

If you open the new paraphrase-project folder in Visual Studio Code, you should now have the following code structure.

Since this is a relatively simple interface, we will only need to modify the “App.svelte” component. The code below shows the entirety of the JavaScript and HTML I used for this project. Though, if you’d like to see the styles as well, you can refer to this code on GitHub.

This will ultimately give us the following frontend interface to demo our project:

Serving our Frontend with FastAPI

To make it easy to deploy, let’s serve our frontend interface by extending our FastAPI code. To do this, let’s first build a production-ready version of our app with the following command.

npm run build

This will compile an optimized version of our JavaScript and place it in the public folder of our Svelte directory. To serve it with FastAPI, copy and paste the public folder to your FastAPI code’s directory. Once complete, all we need to do is add the following to our FastAPI code:

This will serve our built Svelte code on the “/” endpoint. The final FastAPI code can be seen here.

Docker

Docker containers are a lightweight way to package our model and code to work in different environments. In our case, this container will be running on the Ainize servers once we deploy. To create this container, we must first create a Dockerfile. A Dockerfile is merely a set of instructions used to build a Docker image. Let’s look at this more in-depth by first creating a Dockerfile that builds a Docker image of our model alone.

For reference, here is a picture of my directory.

The folder Model contains the fourth epoch of our fine-tuned T5 model. To create an image of this model, we’ll first import the base PyTorch image, which includes many of the necessary prerequisites From there, we’ll create our working directory (WORKDIR) and copy and paste the folder containing our model inside.

We’re now ready to build our model. If you don’t have Docker installed on your desktop already, you can do so here. Once complete, you can build the docker image with the following command:

docker build --no-cache -t {yourDockerUsername}/paraphrasemodel .

We will now create a new docker file for our FastAPI code. This code is much the same as the Docker file used to create our model. The only major difference is that we will use our model’s image as the base image, pip install a requirements.txt file containing the prerequisites for our FastAPI code, and finally tell the container to run the Uvicorn server on port 8000.

Deploying on Ainize

Deploying a large model can be expensive and potentially technically challenging. Fortunately, Ainize makes this process both cheap and easy. If you’re unfamiliar with Ainize, it’s a hosting platform for open-source projects. It allows you to deploy three GitHub repositories entirely for free. They also offer subscription tiers that give you better GPU availability, more API requests, and more projects you can deploy.

The official documentation for Ainize is here. Though, let’s walk through a simple example for our project. To get started, either create an Ainize account or just login with GitHub. Once complete, all that is required to paste the the URL to the GitHub repository containing your Dockerfile.

From here, select the appropriate branch and change the port number to 8000, since that is what we used in our Dockerfile.

Ainize will then automatically build our docker file and run the container allowing us to call the paraphrase model. Once built, you should see the following screen.

Our project is now almost complete! All that is left to do is define the demo path. To do this, navigate to the demo tab and select “Define demo path”. Since our FastAPI code served our frontend code at the “/” endpoint, our demo path would look as follows:

Our model is now fully deployed! You can check out my deployment here.

Improving the Model

If you’d like to improve the paraphrase generation ability of the model, there are a few steps you can take. Here are some of the most obvious ones to me.

  • More variation in training data. Right now, we’re primarily using textbook sentences. While this is okay for our purposes, you may want to collect data from different sources if you’d like the model to do well with different kinds of texts/formats.
  • Use a larger model. T5-base is a 220 Million parameter model. While this is quite large, Transformer models scale very well with parameter size. And with pre-trained models with 11 billion parameters publicly available, this is an easy way to increase performance if you have the compute capability.
  • Increase the quality of back-translation. While we attempted to ensure some quality control with the quality checks, these could likely be extended further. You could also try different, or even multiple, pivot languages to change the grammatical structure more.

--

--