A Look at Data Augmentation | Towards AI
The more data we have, the better the performance we can achieve. However, it is very too luxury to annotate a large amount of training data. Therefore, proper data augmentation is useful to boost up your model performance. Authors of Unsupervised Data Augmentation (Xie et al., 2019) proposed Unsupervised Data Augmentation (UDA) assistants us to build a better model by leveraging several data augmentation methods.
In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language. Not every word we can replace it by others such as a, an, the. Also, not every word has synonym. Even changing a word, the context will be totally difference. On the other hand, generating augmented image in computer vision area is relative easier. Even introducing noise or cropping out portion of image, model can still classify the image.
Xie et al. conducted several data augmentation experiments on image classification (
AutoAugment) and text classification (
Back translation and
TF-IDF based word replacing). After generating large enough data set of model training, the authors noticed that the model can easily over-fit. Therefore, they introduce
Training Signal Annealing (TSA) to overcome it.
This section will introduce three data augmentation in computer vision (CV) and the natural language processing (NLP) field.
AutoAugment for Image Classification
AutoAugment is found by google in 2018. It is a way to augment images automatically. Unlike the traditional image augmentation library, AutoAugment is designed to find the best policy to manipulate data automatically.
You may visit here for model and implementation.
Back translation for Text Classification
Back translation is a method to leverage the translation system to generate data. Given that we have a model for translating English to Cantonese and vice versa. Augmented data can be retrieved by translating the original data from English to Cantonese and then translating back to English.
Sennrich et al. (2015) used back-translation method to generate more training data to improve translation model performance.
TF-IDF based word replacing for Text Classification
Although back translation helps to generate a lot of data, there is no guarantee that keywords will be kept after translation. Some keywords carry more information than others and it may be missed after translation.
Therefore, Xie et al. use TF-IDF to tackle this limitation. The concept of TF-IDF is that high frequency may not able to provide much information gain. In other word, rare words contribute more weights to the model. Word importance will be increased if the number of occurrence within the same document (i.e. training record). On the other hand, it will be decreased if it occurs in the corpus (i.e. other training records).
IDF score is calculated by the DBPedia corpus. TF-IDF score will be computed for each token and replace it according to the TF-IDF score. Low TF-IDF score will have a high probability to be replaced.
If you are interested to use TF-IDF based word replacing for data augmentation, you may visit nlpaug for python implementation.
Training Signal Annealing (TSA)
After generated a large amount of data by using the aforementioned skill, Xie et al. noticed that the model will be over-fitting easily. Therefore, they introduce the TSA. During model training, examples with high confidence will be removed from loss function to prevent over-training.
The following figure shows the value range of ηt while K is the number of categories. If the probability is higher then ηt, it will be removed from loss function.
3 calculations of ηt are considered for different scenarios.
- Linear-schedule: Growing constantly
- Log-schedule: Growing faster in the early stage of training.
- Exp-schedule: Growing faster at the end of the training.
- The above approach is designed to solve problems that authors are facing in their problem. If you understand your data, you should tailor made augmentation approach it. Remember that golden rule in data science is garbage in garbage out.
Like to learn?
I am a Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or Github.
- Data Augmentation in NLP
- Data Augmentation for Text
- Data Augmentation for Audio
- Data Augmentation for Spectrogram
- Does your NLP model able to prevent an adversarial attack?
- Unofficial AutoAugment implementation
- R. Sennrich, B. Haddow and A Birch. Improving Neural Machine Translation Models with Monolingual Data. 2015
- E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le. AutoAugment: Learning Augmentation Strategies from Data. 2018
- Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. Unsupervised Data Augmentation. 2019