An overview of Data Augmentation Without Formulas

Masaya Mori 森正弥
6 min readApr 19, 2020

--

“Data” by Nick Youngson Under CC BY-SA 3.0

Data Augmentation is getting a lot of attentions recently, so I wanted to write a bit about it. It’s significant to understand this technique well since it’s related to the overfitting problem, a general issue in machine learning and how deep learning trains and works.

The following video is helpful for you to catch the overview briefly.

The Data Augmentation technique comes in handy when you want to create a deep learning model for a problem, but you don’t have enough data. In order to train a detailed model, you need usually a huge amount of data. For many problems and application, however, increasing the amount of data with some techniques can be sufficient and effective.

Data Augmentation is a technique to expand the size of dataset by creating modified versions of the dataset. In particular, it works well for Convolutional Neural Network or CNN for image processing. Let’s take some examples of techniques for images. Mix-up needs to be making new labels from original labels. However, these other techniques won’t risk changing the label.

Horizontal and Vertical Shift
Horizontal and Vertical Flip
Random Rotation
Random Brightness
Random Zoom In/Out
Random Crop
Random Erasing
Replace background
Change the background color
Mix-up

Definitely, if you make AI recognize flower images or food images, rotating the images won’t change the labels because it is quite possible to take a photo of flower or food at many angles. Rather, it can be a good way to multiply a sample dataset so as to make the model address real images. So it will be robust.

Recent libraries for computer vision like Keras, TensorFlow, Cognitive Toolkit, imgaug and so on has functions of dataset expansion like the techniques above. Some libraries provide real-time data augmentation in training.

On the other hand, for natural language processing, the situation is a bit different due to high complexity of language. Changing one word could change the meaning of the whole sentence. So, while you want to be careful, there are some following examples. Leveraging existing thesaurus, randomly removing words and so on help to generate lots of text data in a short time. Of course, some of the results could be generating totally meaningless sentences. Even so, it would help enhance the robustness of the model. It is counterintuitive, but impressive.

Simple Synonym Replacement
Replacement of words by calculating similarity
Random Swap of Words
Random Deletion
Back Translation

Some interesting implementations are also available. For example,

And to calculate the similarity of words to substitute, you can use many techniques like unsupervised clustering, word embeddings like word2vec, GloVe or fastText, etc.

Back Translation is a method in the machine translation field to generate more training data to improve translation model performance, which achieved a great score of BLEU. The concept of this method can also be utilized for replacing words, phrases or sentences of training data with back-translated ones. For instance, you use machine translation to translate words of sample data into other language and then translate it back into the original language. And you can replace the original word with the back-translated word to expand the sample dataset. It’s kind of tricky, but an interesting method.

By the way I mentioned the robustness earlier. In general, robustness can be defined as “the ability of a system to resist change without adapting its initial stable configuration”. In the context of computer science, robustness is the ability of a computer system to cope with errors during execution and cope with erroneous input. Thinking about the use case of computer vision, especially for B2C or C2C services, image data can be coming from photos taken by end users through their smartphones, which are not always perfect. Some are taken at a wrong angle. Some are masked partly. Some are blurry. The quality of these image data is unstable. NLP for B2C or C2C is not an exception. Data can include incomplete or unstructured sentences, wrong phrases and the abusage of idioms or words. In those cases, you may want to increase the robustness of the model.

However, data augmentation is not a silver bullet. In some cases, it could worsen the accuracy of the inference with your model. There are some points to which you need to pay attention in using it.

Overfitting

Overfitting is literally when a model fits training data too much and not handling actual data actually as a result. In other words, it’s a modeling error which occurs when a function is too closely fit to a limited set of data points.
When you use a data augmentation technique for increasing the amount of data, it is very important whether the augmented data can help cover more data in production. In some scenarios, very simple data augmentation to image data like rotating images, zooming in/out, etc wouldn’t contribute to increasing coverage of possible images actually. Let’s take an example of factory. If you want to use a deep learning model for object recognition in the factory to detect defective products, the camera for the detection is basically supposed to be fixed at a specific angle to stabilize the quality. The rotation or zoom in/out might never happen. In that case, you may need to think about a strategy on how to expand the size of dataset for covering the distribution of real data.

Increase or Decrease Noise

In machine learning, it is said that you should remove noise out of training data. In preparing the training dataset, you check the format of data, sort it out and remove noisy data out of it. Then you can train the model adequately. Generally that’s true. However, in some cases, it is not always true. When you train the model to handle bigdata or end-users’ data over the internet, actual data could include much noise. It means, you should set aside training data involving some noise in that case.

Waymo, a subsidiary of Alphabet Inc and a self-driving technology development company, published the first paper December 2018.

In this paper, they argue that cloning data is insufficient for handling complex driving scenarios, and synthesizing data by randomizing input of the model after its going through sensors works well, which is a sort of domain randomization and a good example of creating noise for training. Mix-up or crop out at random as mentioned above can be seen as a way related to increasing noise to improve the accuracy.

Having said that, it will be meaningless to increase noisy data which you cannot observe within actual data distribution in production. In that case, you can also think of it as overfitting. Of course such kind of noise within sample data should be removed. It is a discussion point to increase which noise and decrease which noise in the sample dataset.

As mentioned above, Data Augmentation relates strongly with how to address the overfitting problem, and it’s a method that focuses on the potential to enlarge datasets to cover more actual data distribution. In different wording, it’s an attempt to deepen the understanding about the dataset in production which the model is going to face. There’s more. Some techniques like mix-up are very interesting because it will generate impossible data but boost up the performance of the model. That is related to the mystery about deep learning which is an non-linear and lots-of-parameter learning.

Regarding how to make the deep learning model effective with a very small sample dataset, you can also consider the utilization of transfer learning as follows.

--

--

Masaya Mori 森正弥

Deloitte Digital, Partner | Visiting Professor in Tohoku University | Mercari R4D Advisor | Board Chair on AI in Japan Institute of IT | Project Advisor of APEC