Shuffling Paragraphs: Using Data Augmentation in NLP to Increase Accuracy
While data augmentation is increasingly being used in machine learning to train models to classify images, when it comes to natural language processing (NLP) and text analytics, its application is still limited.
That’s partly because data augmentation is relatively new. But most importantly, it simply hasn’t been explored much in the text analytics space yet. It’s also less intuitive to do in NLP than it is with images.
But it is possible and in practice, deeply impactful. In this article, inspired by fast.ai’s Practical Deep Learning for Coders, a 7-week course taught by Enlitic founder and former Kaggle President and Chief Scientist Jeremy Howard, I share my hypothesis that we can use data augmentation in NLP to make a model significantly more accurate.
What is data augmentation?
Let’s first consider the challenge of classifying specific images.
Say we want our model to be able to correctly classify images of cats and dogs, but we only have a limited number of labeled images of cats and dogs. How can we create more to better train our model? And how can we make sure our model doesn’t just learn to recognize specific cats and dogs from among the few examples that we do have?
To create more images, we could modify slightly the existing ones. For example, we could flip an image of a dog horizontally, shift it slightly to the left or to the right, or zoom in or out. (See Exhibit 1.) We could also combine those modifications in various ways.
Each time we make one of those modifications, we are effectively “augmenting” our dataset by generating more examples of cats and dogs that we can feed into our model to make it “smarter” (i.e., better at classifying images of cats and dogs).
But there are limits to this approach. If we take that original dog image and flip it vertically, for example, we now have an image of an upside-down dog. And because we don’t have upside-down dogs in real life, that image would “confuse” our model.
Text analytics: consider the message
Text, like a picture of a dog, cannot just be indiscriminately flipped and shifted.
If we did indiscriminately flip and/or shift it, we would end up with examples that would confuse our model, not make it smarter. Think of trying to train a model to classify movie reviews using both the review text and the same text written in reverse. It would make no sense.
So, while we can use data augmentation to get more text samples and improve our NLP models, we must first consider the message the text is conveying and the format in which it’s being presented.
Let’s take an example of a document with a specific format, like a résumé.
A résumé is usually laid out in sections, such as personal information, work experience, education, interests, hobbies, etc. (See Exhibit 2.)
From a content perspective, those sections are independent from each other. In other words, each section would convey the same message and meaning even if the order in which they appeared in the document changed. If we shuffled the paragraphs in a résumé, we’d get another résumé conveying the same message.
More importantly, for the purposes of training our model, each shuffled résumé would preserve the label we gave to its original version (e.g., a good resume).
Say that we have identified 100 résumés as representing a “candidate we want to talk to.” A résumé with its paragraphs shuffled around would be still classified as representing a “candidate we want to talk to.” That’s because the order of the paragraphs on a résumé does not change the way we would classify it. Put another way, we would talk to a good candidate regardless of whether she listed her education before her work experience or vice versa.
And as I learned when I tested my hypothesis, even a rudimentary and random shuffling of résumé sentences will create thousands of “artificial” résumés that can be fed into our machine learning algorithm — and improve our classification accuracy by a striking 10 percent!
This approach could easily be applied to a real-world hiring situation. Since organizations don’t typically have thousands of résumés on hand to train a model, they could use this augmentation technique to create a machine learning classification to prioritize the efforts of their HR team, saving both time and money during the hiring process.
Context + algorithm = impact
As the résumé classification example makes clear, there are two components that we need to consider when using data augmentation in NLP to improve the accuracy of a machine learning model.
1. The context in which the augmentation technique is used. To fully harvest the boost to our NLP model, we need to base our data augmentation design on a deep understanding of our document’s structure and content.
With the résumé, the fact that the information contained within the different sections isn’t altered by being moved around is critical. That same data augmentation technique could never be applied to a format where the meaning of the text depends on its order, such as a novel. The application of data augmentation must be specific to the context in which it’s being used.
2. The augmentation technique being applied. The impact of applying data augmentation will depend on the specific technique itself. Every technique can potentially induce the machine to learn something different and in turn, generate a different impact.
Once we have a deep understanding of our document’s format and the information that it contains, we can experiment with different augmentation techniques to find the one that most meaningfully impacts our issue.
Although data augmentation for NLP and text analytics techniques are probably underdeveloped, they are worth pursuing as they can quite substantially boost our NLP model performance. To fully harvest their potential, however, we must base our data augmentation design on a deep understanding of both our document’s structure and content.
This technique has been tested on a dataset of about 450 documents 55% of which are labeled “positive” and 45% are labeled “negative”.
Because of the structure and semantic of each of those documents, paragraphs can be assumed to be independent of each other.
After several training cycles, a DNN with (not pre-trained) word embeddings performed correct classification of about 68% of the documents in the test group.
The same DNN classified 75% of the documents correctly when applying the data augmentation technique discussed in this paper (~10% increase).