An Extrinsic Evaluation of Data Augmentation for NLP: Intents Classification case study

Published in

Opla

8 min readJul 31, 2019

Applying AI algorithms to small data has always been one of our major interests at Opla and Data Augmentation is among the key solutions to this problem. This article extends the work presented in our previous article on data augmentation.

1. Previous Research Work

Data augmentation has shown noticeable results in cross-cutting fields such as image processing [11, 13] or speech recognition [6, 9]. Recently, research work has been conducted to apply this technique on text data which is challenging because of the complex structure of text data. While geometric transformations such as rotation and scaling are applied in image processing, in NLU, transformations aim at creating new forms of text units through paraphrase generation, morphological transformations, etc. One of the most used techniques for text data augmentation relies on text translation. Source text is translated into various intermediate languages then translated back to the original language [1]. This transformation allows creating new expressions of the same sentence. Sahin and Steedman [16] used dependency trees to augment data by changing their structure (move or trim tree fragments). Du and Black [5] performed permutation and flipping for neural dialog response selection task. Kurata et al. [12] trained an encoder-decoder to reconstruct conversation utterances in the training data. To perform augmentation, the encoder’s output hidden states are randomly perturbed to generate new utterances.

2. How can we enlarge our training data

In this work, we try to study the effect of data augmentation in the context of intent classification for dialog systems. Our training data consists of a collection of utterances, usually composed of one sentence. Each utterance is tagged with the user intent. Our goal is to create for each utterance in the training data, new similar utterances that inherit the same label as the original one. This task can be considered as the task of paraphrasing the labeled samples. To paraphrase the available utterances, we combined three techniques: words shuffling, word replacement with synonym and word replacement with semantically related words.

2.1. Words shuffling

This method consists of simply modifying the order of the words in a sentence to create a new one. In this case, the classification algorithms should be sensitive to the order of the words.

2.2. Synonym replacement

This variant uses an external thesaurus of synonyms or paraphrases. To paraphrase a sentence, for each of its words, we look for a synonym in the synonyms database. If a synonym is found, we use it to replace the original word. The examples below are extracted from our results. The words in bold are those who have been replaced.

What to use to quickly cut Audio/Video ➜ What to use to speedily clip Audio/Video

Shut down without extra question ➜ Shut down without additional issues

We choose to replace every word for which a synonym is found in the database to get as different as possible paraphrases.

2.3. Similar word replacement

In this variant, we chose to use distributional word representations known as word embeddings to enrich our training data. For each word in a sentence, if the word is represented in the word embeddings we replace it by its most similar word that has the same Part Of Speech (POS) tag.

3. Experiments settings

Since our ultimate goal is to improve the intent classification performance, we conducted an extrinsic evaluation of the described data augmentation methods by comparing the performance of the classification with and without data augmentation.

3.1. Datasets

We used mainly three publicly available datasets[3]:

Ask Ubuntu Corpus: gathers 162 questions and five intents extracted from AskUbuntu forum
Web Applications Corpus: composed of 89 questions and eight intents extracted from StackExchange
Chatbot Corpus: gathers 206 questions and two intents extracted from a Telegram chatbot for public transport in Munich.

We used 10-folds cross-validation to split the data into train and test sets.

3.2. Classification algorithms

We chose to use three state-of-the-art classification algorithms:

Support Vector Machines (SVM) [4] with a linear kernel
K-Nearest Neighbors (KNN) [10] with k set to 5
Extra Tress Classifier (ETREE) [8] with 200 estimators.

3.3. Preprocessing

During the augmentation process, we remove all stop words. In the synonym replacement approach, we stem both our dataset and the vocabulary of the thesaurus. Finally, during the classification, all the words are stemmed to get more accurate word counts.

3.4. Vectorization

We used TF.IDF (Term Frequency x Inverse Document Frequency) to represent our training data features since our original training data sets are relatively small.

3.5. External resources

a- Synonyms dictionary: we used the ParaPhrase DataBase (PPDB)[7]

b- Word embeddings: we used different pre-trained word vectors:

GV6B: are pre-trained word vectors using GloVe [15] on Wikipedia 2014 + Gigaword 5 with 400K vocabulary size
FTXT: are pre-trained with FastText [2] on Common Crawl and Wikipedia
W2V: are pre-trained with Word2Vec [14] on the Google News corpus with a vocabulary size of 3 million

4. Evaluation and results

To evaluate the performance of the classification algorithms we measure each time the micro-averaged F-measure. The micro-averaged F-measure aggregates the contributions, in terms of precision and recall, of all classes to compute the average metric. This is more relevant if we suspect imbalanced class distribution. Figure 1 shows the class distribution of each dataset. In all data sets, there is not a predominant class. Despite this, we use the F-measure to produce realistic scores.

The results of the multiple experiments are presented in tables 1, 2 and 3 using respectively SVM, KNN and ETREE algorithms. The different variants that we evaluated are:

W/O DG: classification without data augmentation
WS is the variant when we perform words shuffling. We chose to generate for each sentence n random shuffles where n is the length of the original sentence
GV6B & FTXT apply the most similar word replacement respectively from GV6B and FastText word vectors
GV6B+WS is GV6B with the application of word shuffling over the augmented data
PPDB applies synonym replacement using the PPDB paraphrase database which includes paraphrases of words and phrases. In this work, we use only words (unigram) paraphrases.
PPDB-W2V applies the most similar word replacement from W2V word vectors over augmented samples using PPDB

We notice, first of all, that without data augmentation SVM and ETREE perform slightly better than KNN. We also note that the performance on the Web Application data set is considerably weaker than on the AskUbuntu and the Chatbot corpora. Now if we take a look at the data augmentation impact, we can see that word shuffling, as simple as it is, allows to gain some points in the performance. This gain is more important on the WebApplication dataset and is around +15%.

Regarding the paraphrasing techniques, all the presented approaches were able to improve the results over the initial results and also over the word shuffling technique. Performing word shuffling over data already augmented with paraphrases has, overall, led to better scores.

Among the semantically related word replacement techniques (GV6B, GV6B+WS, and FTXT), it is hard to decide which one performs better since their scores are very close. On top of that, each classifier reached its best performance using a different combination of augmentation methods.

However, if we examine the scores of the PPDB variant in the three tables, we can conclude that using word embeddings for data augmentation works better than using paraphrases or synonyms. We suppose that word embeddings, by providing semantically related words and not exact synonyms, allow generating more diverse and generalized data. Another possible reason is the augmented data size. Using word embeddings we could multiply the data size by 45 on average. Using, the synonym replacement method we only multiplied by 2 the initial data size.

Overall, the evaluation of the proposed data augmentation techniques shows that they considerably enlarge our datasets and thus improve the classification performance.

5. Conclusion

We tried through this work to demonstrate the benefit of using data augmentation for the task of intent classification where training data is usually insufficient. The reported results show that simple data enrichment techniques such as word shuffling, synonym replacement or semantically related word replacement can yield significant improvement.

Our future research will focus on context-aware data augmentation to handle intent detection systems that exploit context. Future studies will also be conducted to apply data augmentation to generate complete conversations rather than just utterances which should help a lot with machine learning-based answer generation systems.

References

[1] Segun Taofeek Aroyehun and Alexander Gelbukh. 2018. Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018). 90–97.

[2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

[3] Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. 2017. Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, SaarbrÃĳcken, Germany, 174–185. http://www.aclweb.org/anthology/W17-3622

[4] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.

[5] Wenchao Du and Alan W Black. 2018. Data Augmentation for Neural Online Chat Response Selection. arXiv preprint arXiv:1809.00428 (2018).

[6] Takashi Fukuda, Raul Fernandez, Andrew Rosenberg, Samuel Thomas, Bhuvana Ramabhadran, Alexander Sorin, and Gakuto Kurata. 2018. Data Augmentation Improves Recognition of Foreign Accented Speech. Proc. Interspeech 2018 (2018), 2409–2413.

[7] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Database. In Proceedings of NAACL-HLT. Association for Computational Linguistics, Atlanta, Georgia, 758–764. http://cs.jhu.edu/~ccb/ publications/ppdb.pdf

[8] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning 63, 1 (2006), 3–42.

[9] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.

[10] Flip Korn, Nikolaos Sidiropoulos, Christos Faloutsos, Eliot Siegel, and Zenon Protopapas. 1998. Fast nearest neighbor search in medical image databases. Technical Report.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.

[12] Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Labeled Data Generation with Encoder-Decoder LSTM for Semantic Slot Filling… In INTERSPEECH. 725–729.

[13] Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. 2015. Data augmentation for reducing dataset bias in person re-identification. In 2015 12th IEEE International conference on advanced video and signal based surveillance (AVSS). IEEE, 1–6.

[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.

[15] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[16] Gozde Gul Sahin and Mark Steedman. 2018. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5004–5009.