Hands on Data Augmentation in NLP using NLPAUG Python Library

Roshan Nayak
CodeX
Published in
4 min readApr 7, 2022

A sentence with same meaning can be written in multiple ways.

Photo by Brett Jordan on Unsplash

In this article I will make a deep dive into how to actually use the amazing nlpaug library. I will assume that the reader has already got a good understanding of what Data Augmentation technique is and why its actually used. If not then here is my understanding of data augmentation,

Data Augmentation is a regularization technique employed to enhance the data by generating new samples from the existing one’s. This adds variety to the data helping the model to generalize well and reduce overfitting.

In order to have a better understanding, I had summarized a research paper in my previous article. Do give it a read, the link is here.

Data Augmentation in NLP

Data augmentation in NLP refers to modifying an existing sentence to obtain a new sentence that resembles the existing sentence. The accuracy of the data augmentation technique is measured in terms of how close does the newly generated sentence resemble the sentence from which it is generated. Huge deviation in the meaning will have serious consequences. For instance over modifying the sentence could end up inverting the sentiment of the generated sentence in case of sentiment analysis task.

Need for Data Augmentation in NLP

  1. Having imbalanced dataset will not help the model generalized well. The learnings of the model will be influenced more by the majority class due to which the model will end up predicting the majority class most of the times. This model cannot be used in production. Hence in order to tackle this problem we can used data augmentation to generate new samples for the minority class. This technique is called over sampling.
  2. Small datasets cannot be trained on heavy models like Bert, Roberta, etc. Since these models have millions of parameters they take a lot of data in order to generalize and get some meaningful insights from the data. We can generate new samples using data augmentation technique and expand our dataset.
  3. Can be used to increase the variety of the data. Many a times the model has the tendency to overfit the training set. Increasing the variety of the data will help the model generalize better.

Exploring the Data Augmentation techniques for NLP

Before jumping into the techniques there are several parameters that control the augmentation process and are common across all the techniques and are quite important to be understood.

  • lang (str) — Language of your text. Default value is ‘eng’.
  • aug_p (float) — Percentage of words that will be augmented.
  • aug_min (int) — Minimum number of words to be augmented augmented.
  • aug_max (int) — Maximum number of words will be augmented. If None is passed, number of augmentation is calculated via aup_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_p. Otherwise, using aug_max.
  • stopwords (list) — List of words which will be skipped from augment operation.

Now that we are aware of the parameters that control the augmentation techniques lets dive into the implementation part. There is a plethora of ways in which sentences can be created starting from the existing ones. Let’s examine the most popular and intuitive ones and their implementation. I am going to use the NLPAUG library of python.

Install NLPAUG execute the command given below and import the necessary libraries.

!pip install nlpaugimport nlpaug.augmenter.word as naw
  • Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
syn_aug = naw.SynonymAug(aug_p=0.3)
sentence = 'climate change puts the squeeze on wine production'
mod_sentence = syn_aug.augment(sentence, n=1)
print('Original text :', sentence)
print('Augmented text :', mod_sentence)

Original text : climate change puts the squeeze on wine production

Augmented text : clime change position the squeeze on wine production

  • Random Swap: Randomly choose two adjacent words in the sentence and swap their positions.
sentence = 'climate change puts the squeeze on wine production'
aug = naw.random.RandomWordAug(action='swap', aug_p=0.3)
mod_sentence = aug.augment(sentence, n=1)
print('Original text :', sentence)
print('Augmented text :', mod_sentence)

Original text : climate change puts the squeeze on wine production

Augmented text : change climate puts squeeze the on production wine

  • Random Deletion: Randomly remove each word in the sentence.
sentence = 'climate change puts the squeeze on wine production'
aug = naw.random.RandomWordAug(action='delete', aug_p=0.3)
mod_sentence = aug.augment(sentence, n=1)
print('Original text :', sentence)
print('Augmented text :', mod_sentence)

Original text : climate change puts the squeeze on wine production

Augmented text : climate puts on wine production

  • Antonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its antonyms chosen at random.
sentence = 'climate change puts the squeeze on wine production'
aug = naw.AntonymAug(aug_p=0.3)
mod_sentence = aug.augment(sentence, n=1)
print('Original text :', sentence)
print('Augmented text :', mod_sentence)

Original text : climate change puts the squeeze on wine production

Augmented text : climate change divest the squeeze on wine production

I have covered four techniques in this article. The only thing to keep in mind while augmenting is that the polarity of the sentence before and after augmenting remains the same. Do explore the official documentation of the library to get acquainted with many more techniques. Have shared the link in the reference section.

Thanks for reading the article. I hope you enjoyed the reading!

References:

If you haven’t read my latest blog on how to implement word2vec, then check out the link given below!!

--

--