Two minutes NLP — A Taxonomy of Data Augmentation for Text Classification

Noise induction, rule-based transformations, synonym replacement, and embedding replacement

Published in

NLPlanet

5 min readApr 4, 2022

Hello fellow NLP enthusiasts! Data augmentation is an interesting technique that may improve the quality of your model predictions using simple transformations on your training data. In this article you’ll see a taxonomy of data augmentation methods for text classification problems. Enjoy! 😄

Why data augmentation

Data augmentation in machine learning consists in creating artificial data that increase the size of the training set and allow models to reach better performance. It’s a widely studied research field across machine learning disciplines.

Data augmentation is useful for many reasons, among which:

It increases model generalization capabilities;
It’s useful for unbalanced datasets;
It minimizes the labeling efforts;
It increases robustness against adversarial attacks;
It limits the amount of data used to protect privacy.

Typically data augmentation in text classification leads to better models as the models see more linguistic patterns during training. However, this effect is nowadays somewhat managed by transfer learning on large pre-trained language models, as these models are already invariant to various transformations. Indeed, it is hypothesized that data augmentation methods can only be beneficial if they create new linguistic patterns that have not been seen before.

A taxonomy of Data Augmentation methods for Text Classification

We report here a taxonomy of data augmentation methods for text classification from the paper A Survey on Data Augmentation for Text Classification. Keep in mind that a common technique is to combine several data augmentation methods to achieve more diversified instances.

Taxonomy and grouping for different data augmentation methods. Image from https://arxiv.org/pdf/2107.03158.pdf.

Data Augmentation in the Data Space

Data augmentation in the data space deals with the transformation of the input data in its raw form, i.e., into the readable textual form of the data.

There are four types of data augmentation in the data space: character level, word level, phrase and sentence level, and document level.

Character Level

This type of data augmentation deals with creating new training samples from existing ones by changing single characters.

Noise Induction: Deals with random character deletion, swap, and insertion.
Rule-based Transformations: Valid transformations through the use of regular expressions like, amongst others, the insertion of spelling mistakes, data alterations, entity names, and abbreviations.

Word Level

This type of data augmentation deals with creating new training samples from existing ones by changing single words entirely.

Noise Induction: With “unigram noising”, words in the input data are replaced by another word with a certain probability. By the method of “blank noising”, words get replaced with “_”. Other noise induction techniques are random word swap and deletion.
Synonym Replacement: This very popular form of data augmentation describes the paraphrasing transformation of text instances by replacing certain words with synonyms. Synonym replacement is usually done leveraging knowledge bases like WordNet.
Embedding Replacement: Comparable to synonym substitution, embedding replacement methods search for words that fit as well as possible into the textual context and additionally do not alter the basic substance of the text. To achieve this, words are translated into a latent representation space, where words of similar contexts are closer together, and then a word is replaced by a word that is close to it in the latent representation space.
Replacement by Language Models: Language models represent language by predicting subsequent or missing words on the basis of the previous or surrounding context. In this way, the models can, for example, be used to filter unfitting words, thus filtering bad words produced with the embedding replacement techniques. In contrast to embedding replacements by word embeddings that take into account a global context, language models enable a more localized replacement.

Phrase and Sentence Level

This type of data augmentation deals with creating new training samples from existing ones by changing sentence structures.

Structure-based Transformation: Structure-based approaches to data augmentation may utilize certain features or components of a structure to generate modified texts. Such structures can be based on grammatical formalities, for example, dependency and constituent grammars or POS-tags. For example, some sentences can be cropped by putting the focus on subjects and objects; with the “rotation” technique, flexible fragments are moved.
Interpolation: This method works by substituting substructures of the training examples if they have the same tagged label. For example, the sentence substructure “a [DT] cake [NN]” in an instance (where [DT] and [NN] are English POS tags, which are Determiner and Singular Noun respectively) can be replaced with the new sentence substructure “a [DT] dog [NN]” of another instance, thus making an interpolation.

Document Level

This type of data augmentation deals with creating new training samples from existing ones by changing entire sentences in the documents.

Round-trip Translation: Round-trip translation is an approach to producing paraphrases with the help of translation models. A word, phrase, sentence, or document is translated into another language (forward translation) and afterward translated back into the source language (back-translation).
Generative Methods: As the capabilities of language generation increased significantly, the current models are able to create very diverse texts and can thus incorporate new information. Generative methods for document-level data augmentation consist of training language models (VAEs, RNNs, Transformers) to produce documents similar to the ones in the training data.

Data Augmentation in the Feature Space

Data augmentation in the feature space deals with the transformation of the input data in its feature form, i.e., into the latent vector representations of the inputs.

There are two types of data augmentation in the feature space: noise induction ad interpolation methods.

Noise Induction: As in the data space, noise can also be introduced in several variants in the feature space. For example, it’s possible to apply random multiplicative and additive noise to the feature representations.
Interpolation Methods: It consists of creating a new sentence from the interpolation of hidden states of two sentences, containing the meaning of both original sentences.

Conclusions and next steps

This survey provides an overview of data augmentation approaches suited for the textual domain. Data augmentation is helpful to reach many goals, including regularization, minimizing label effort, lowering the usage of real-world data, particularly in privacy-sensitive domains, balancing unbalanced datasets, and increasing robustness against adversarial attacks.

On a high level, data augmentation methods are differentiated into methods applied in the feature and in the data space. These methods are then subdivided into more fine-grained groups.

Keep in mind that the increasing usage of transfer learning methods has made some data augmentation methods obsolete, as they follow similar goals.

Possible next steps are:

Try nlpaug, a Python library that helps you with augmenting NLP data.
Learn how to perform data augmentation on HuggingFace datasets.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!