4 Steps to Create Synthetic Datasets with T5, PAWS and Your Text Corpus

4 min readNov 17, 2020

Does your organization train deep neural nets? Let me guess, data quality is a problem!

Read this article, to learn how to bring paraphrasing to Google’s T5, and then fine tuning it with custom textual data sets for synthetic dataset creation.

This is the Paraphrase Generator in action:

Among the many challenges of supervised machine learning, the dependency on large labeled datasets ranks as the highest. Creating high-quality datasets is costly and difficult. This is even more accentuated for smaller organizations.

Fortunately, our generative method allows the generation of synthetic datasets that match the distribution of labeled datasets, making it possible to accelerate the training of machine learning models.

The process of creating such a fake data set is explained in the following four steps. Find the corresponding code in this Github Repository.

Step 1: Fine tune T5 with PAWS Data Set to Teach ‘paraphrase: ‘

Since Google’s T5 has been trained on multiple tasks (e.g., text summarization, sentence correctness — cola, and sentence similarity — stsb) solely through Text-to-Text tasks, it is handy for teaching it additional capabilities. We taught T5 via fine tuning it with the PAWS dataset. The PAWS dataset consists of approximately 50.000 labeled paraphrases.

Step 1: Fine tune T5 on the PAWS Dataset (Icon Credits to FreePik)

After that, we resulted in a T5-P(araphrase) model which can create paraphrases from texts without a specific context. However, this one is not optimally suited for our business journal context. Hence, we used a text corpus to make its paraphrases sound more biz.

Step 2: Use T5-P with Custom Data to Create Synthetic Data

Step 2: Use T5-P for Creating a Custom Dataset (Icon Credits to FreePik)

As a next step, we use our custom data (i.e., a business text corpus) to create paraphrases with our new capability of T5-P(araphrase), i.e. paraphrasing. Although we only have a small corpus (330 samples), we use these to create a data set from it.

This resulting synthetic data set has the original 330 samples in the first column, and the created paraphrases in a second column.

Step 3: Use T5 to Ensure Synthetic Data’s Syntactical & Semantical Quality

Step 3: Use T5-P To Ensure Data Set Quality (Icon Credits to FreePik)

Before, using our T5-P to create business paraphrases, we want to ensure that we only fine tune it with semantically congruent and syntactically correct paraphrases. Hence, we have to evaluate our newly created synthetic data set.

Evaluation techniques like ROUGE and BLEU are not useful in our context since they look for similar words. Obviously, paraphrases are better if they use different words to communicate the same.

Therefore, we use a trick: Google’s T5 is trained for similarity (feature: ‘stsb sentence 1: […]sentence2: […]) which we can use to ensure that our original sentence and its corresponding paraphrase have the same semantic meaning. In our project, 95% of sentences had a similarity score above 4.0 which we considered further).

Yet, our sentences could still be syntactically incorrect. Hence, we also evaluated that with a toolkit of T5. Checking for syntactical correctness is being done by using ‘cola: [‘sentence’]’ which we applied to the synthetic paraphrases. Only three out of 313 samples were not accepted.

Step 4: Fine tune T5-P with Qualified Synthetic Data for Custom Paraphrases

Step 4: Fine-tune T5-P to Write Custom Paraphrases (Icon Credits to FreePik)

Finally, we fine tune our T5-P(araphrase) model with our synthetically created data set (that is both syntactically and semantically correct). After that, we can create very authentic paraphrases resembling our business corpus. You can use this Colab for inference to create your own.

(Optional) Using the fine-tuning through a GUI or FastAPI

If you’d like to make your service interactive, you can provide it as a flask frontend which was adapted from the works of Renato (Kudos!). Alternatively, you can access it via an API call through FastAPI (find a how to here)

Thank you for reading this article. I am curious about your opinion.

Hi there! I want to work in NLP.

I am Sebastian (M.Sc. in IT /Business) an NLP Deep Learning Engineer. I help organizations to use artificial intelligence with natural language processing.

In my former life, I was a manager at Austria’s biggest banking group. In the future, I want to work remotely flexibly & in the field of NLP.

Contact me to get in touch!

Kudos to Narrativa.com

This service was created in cooperation with Narrativa.com, a Spanish deep learning company, that helps companies to create texts with the help of AI. Javier García Ortíz (Chief Scientist) was leading the project from Narrativa’s side.