Data Augmentation- Increasing Data Diversity

Heena Rijhwani
The Startup
Published in
8 min readDec 21, 2020

Data Augmentation is a technique that aims to expand existing data by making slight modifications to the data. In NLP, it is often used to increase size of the training data and improve performance of the model. Here we will look at Data Augmentation using:

  • Word Embeddings
  • BERT
  • Back Translation
  • T5

Data Augmentation using Word Embeddings

Let’s first look at augmentation of data using Word Embeddings. Using word embeddings, we can represent words as vectors in a high-dimensional space in a meaningful way. For instance, we have two sentences:

Winter is Coming.

Any man who must say “I am the king” is no true king.

To expand our data, we can replace keywords, i.e. words that are more relevant and give meaning to the sentence. In these sentences, the keywords are winter, king and man.

To augment the data, we can find similar words for these keywords. For instance, summer, queen and woman respectively. This can be done automatically using word embeddings. Words that are related to each other are more likely to occur closer in embedding space. If we search in a closer space, we will be able to find words similar to our keywords.

Step 1-Find keywords

Step 2- Find similar words

Step 3-Replace keywords using new words to augment the data.

Careful supervision is needed to shortlist only relevant results from the generated outcomes. For example, using “throne” instead of “king” in the earlier example would change the meaning of the sentence and render a semantically incorrect sentence. Let’s try and implement this concept.

  1. Install libraries

Here we use gensim library to interact with vectors.

2. Download Google news vectors

The second step is to download and loading word embeddings. Here we are using the google news vectors. The vocab length shows that downloaded model has word embeddings for 3 million words.

3. Next, for the given list of keywords, we can get vectors using the get_vector method.

4. Cosine Similarity

We can determine if two vectors are similar or not by using cosine similarity. Mathematically, it gives cosine of the angle between the two vectors i.e. if two words are similar, their vectors will be closer to each other, thus the cosine will be smaller, indicating that cosine similarity will be higher. For instance, if we calculate the cosine similarity between winter and summer we get an output of 0.71. We can also use the gensim similarity function to get this result directly.

5. Similar words

We can pass a word to the most_similar function in order to get similar words. Here positive implies positively similar words will be returned and topn specifies how many similar words will be returned.

For winter, we get similar words as summer, spring, autumn and so on.

For man, we get words like woman, boy, girl, and more.

Similarly for king, we will get words queen, prince, sultan, and throne among the top 10.

As we can see, not all words that are returned are relevant. From all the similar words we get, we can shortlist the ones that fit into our context and create new sentences using them to augment our dataset. We can also filter out our words using a minimum similarity threshold.

Data Augmentation using BERT

Bert is a language representation model which stands for Bidirectional Encoder Representations from Transformers. It is a method of pretraining language representations and has a myriad of applications in NLP. It was trained on a large amount of text data from Wikipedia and Book Corpus. This model is trained on two tasks -

  • Masked word prediction and
  • Next sentence prediction

The idea behind masked word prediction is to mask keywords in sentences and let BERT predict the keyword. For instance, if we mask “king” in the earlier example, we get:

Any man who must say “I am king” is no true [MASK].

We give this as an input to BERT and it will return multiple sentences with possible predictions for the masked word such as king, warrior and prince.

In word embeddings, we were generating similar words but not all the generated words were relevant to the context. However, this does not happen in case of BERT. BERT takes entire sentence as input, understands meanings with respect to context, and predicts only relevant outcomes. For example,

“The man was accused of robbing a bank.” “The man went fishing by the bank of the river.”

Word2Vec would produce the same word embedding for the word “bank” in both sentences, whereas BERT would give different word embedding for “bank” in each sentence.

BERT uses words on the left and right of the masked word to understand the context and predict masked words. Thus it is a very useful language representation model and has better results as compared to context free word embedding methods. Most predicted outcomes are relevant to our input sentences, and less supervision is required unlike the Word2Vec model. Bert is also available in multiple languages and has a multi-lingual version which we will be exploring.

First, we mask keywords and input these sentences. Then a prediction is made using BERT which gives the augmented dataset. Now let’s look at the implementation.

1. First, install the required libraries. Here transformers libraries is essential.

2. Create masked word prediction pipeline using a pre-trained model

Transformers library supports various types of pipelines. We will use the fill-mask pipeline, and the model name we have specified is bert-based-uncased. This is a pre-trained model on English lower case data. Several other models can be used in fill-mask pipeline for masked word prediction, which can be seen here.

3. Predict masked words

For example, if we mask the word “tie” in the sentence “Men should wear mask and tie for tomorrow’s event”, our predictions include jeans, shorts and trousers. If we mask “shirt” in the same sentence, we get jacket, suit and shirt among the predictions. However, BERT does not support masking multiple words in the same input, thus we must mask the words one at a time.

4. Multilingual BERT for data augmentation in multiple languages

To apply the technique to languages other than English, we need to create another pipeline with a model pre-trained in that particular language or we can use the BERT multi-lingual version which is a single model pre-trained in various languages, although this might affect the accuracy. For example, if a German input is given, the model detects the language and outputs relevant results in the same language. Better results can be achieved if we use a model pre -trained in the specific language instead of multilingual BERT.

Now we haven generated variations for keywords using Word2Vec and BERT and using these we can expand or augment our dataset. But this process is still not fully automated. Finding keywords, masking them and shortlisting outcomes needs some human supervision.

Data augmentation using Back Translation

Sometimes we might not have the right set of keywords or might have to do lot of manual work to find those, using back translation we can get paraphrases or variations of sentences, paragraphs or even entire documents in an automated way. Generated data is also grammatically correct and accurate. Moreover, certain languages like English have a vast quantity of training data while other languages might have lower resources. In this case, back translation translates target language to source language and mixes both original source sentence and back-translated sentence to train a model. So the number of training data from source language to target language can be increased.

For instance, Neural translation can be used to translate Hindi to English and vice versa. The sentence is processed through all layers of a Deep Neural Network and we get a translated Hindi sentence. Often this generated sentence is not the same as the original sentence and in this way, we can augment our training data. Reverse translation is the best technique for data reproduction using which we get sentences with the same meaning but different words.

  1. Install essential libraries

2. Load translation model

Here we use fairseq library by Facebook.

3. Apply translation

We implement an English to German translation using the generate method. Here beam size gives the number of outcomes (top 10). First the input is encoded, using encode method, then the encoded text is passed as input to generate method. This outputs some tokens which we can decode to get German output.

4. Back translation

Now we have to translate results back to English to augment our data.

Data augmentation using T5 (text to text transfer transformer)

The model trained on the Colossal Clean Crawled Corpus dataset. It is capable of performing various NLP tasks such as translation, summarization, question answering and classification. It reframes every NLP task into text to text format. We can use its text summarization capabilities for data augmentation. It will take an input, summarize the input text and rephrase sentences or use new words to generate the summary. This is often used for NLP tasks involving long text documents. Another approach to use T5 for data augmentation is transfer learning. We can fine tune T5 for masked-word prediction like BERT using the same C4 dataset it is trained on. Masking multiple words will allow for different sentence structures and more variations in keywords. Another way is to fine tune T5 for paraphrase generation. Paraphrasing implies that the output will have same meaning as input but different sentence structures and key words. Here we will be using the PAWS dataset for paraphrase generation.

  1. Install libraries

First we install essential libraries.

2. Data Preprocessing

Then we need to upload the PAWS dataset that we need for fine tuning T5. The dataset has three columns- sentence 1, sentence 2 and label. The label is 1 if two sentences are paraphrases and otherwise, it is 0. For instance, “Data Augmentation is a part of NLP” and “NLP uses Data Augmentation” are paraphrases, thus the label will be 1. Then we must pre process the data. We only keep rows where the label is 1 or training, i.e, the sentences which are paraphrases.

3. Finetune T5 for Paraphrase Generation

Next, we have to finetune T5 for paraphrase generation. We decide some configuration parameters such as the number of paraphrases to be returned, and more.

Paraphrases are generated using a combination of top-k sampling and top-p nucleus sampling. Next step is to create an object of the T5 model class. We pass the configuration parameters and type of T5 model.

After finetuning, we can load the model and generate paraphrases using predict method.

The quality of such augmented data can be improved using a similarity filter. The augmented data might have some duplicates or identical samples. For instance, “Purple mushrooms grow in the forest”. In this case, the augmented output might include sentences like “Purple mushroom grows in the forest”, or “Purple mushrooms grow in the forests”. Such variations might not always be useful or add value to our dataset. There is a chance that these will merely increase the size of our dataset without having an impact on the accuracy. In order to get better performance, the augmented data must have new keywords or varied sentence structures. We can filter out such data by using methods like lemmatization or stop words removal. Another issue faced with augmented data is that the generated output might not be relevant or semantically similar to our original data. To filter out such outcomes, we can measure semantic similarity between input and output sentences and discard outputs with a semantic similarity below the set threshold.

--

--

Heena Rijhwani
The Startup

Final Year Information Technology engineer with a focus in Data Science, Machine Learning, Deep Learning and Natural Language Processing.