Automatic Data Augmentation: an Overview and the SOTA
Want a state-of-the-art computer vision model? You need a gnarly data augmentation pipeline. That’s non-negotiable at this point of AI development.
But the process of cobbling together a data augmentation pipeline is conventionally a manual and iterative one; it is a pain. However, note that I said ‘conventionally’. That’s because there is a blossoming (yet not well implemented heretofore) literature on automating the search process of that gnarly augmentation pipeline.
Automating the Search: Overview
So far, there have been roughly two approaches. The ‘AI-model approach’ attempts to search through a big space of augmentation policies to find an optimal one using reinforcement learning or GANs. It has yielded remarkable results, with Adversarial AutoAugment achieving the current state-of-the-art performance. I think this approach is the future of automated data augmentation; however, it may not (yet!) lend itself well to individual developers. In it, we must train a whole GAN — a process requiring a tricky implementation and nontrivial computing resources. Not so good if our only GPU power is from Kaggle kernels and Colab notebooks (though Faster AutoAugment is helping to allay the computational expense, and I believe that the approach will become more accessible in at most a few years’ time).
I want an automatic search process that I could just plop down and stop thinking. Fortunately, that’s what we get with the other data augmentation pipeline search strategy, the ‘randomness-based approach’, which reduces the search space (by using fewer parameters) and randomly samples policies. Sacrificing flexibility for speed, this approach, incarnated in the RandAugment algorithm, yielded a performance competitive with the AI-model approach…as of a couple years ago. The latter method developed further and now outperforms RandAugment. However, RandAugment is still much faster, and if you just need a ‘good enough’ data augmentation pipeline— one that is easy-to-use and still better than manually and iteratively cobbling one together — it is a viable option.
There is another, not-yet-widely-known algorithm that falls somewhere between the poles of randomness-based and AI-model approaches. It does use a deep learning model to select optimal transformations; however, the model is not anything like a separate GAN, but precisely the one being trained. This algorithm is much slower than RandAugment, but still a few times faster than Faster AutoAugment, the fastest AI-based method, while matching in performance Adversarial AutoAugment, the best performing AI-based method.
I would love to give you the name of this algorithm, but it seems to lack one, or at least a good one. The paper, “On the Generalization Effects of Linear Transformations in Data Augmentation” (2020), referred to this algorithm as the “uncertainty-based transformation sampling scheme”. We need a new name.
How about ‘MuAugment’, standing for Model Uncertainty-based Augmentation? Let’s go with that.
Diving into MuAugment
We have to understand RandAugment before we do MuAugment. Fortunately, RandAugment is dead simple.
We have a list of `K` transforms (e.g. HorizontalFlip, ChangeBrightness). Select `N` (`N` < `K`) of the `K` transforms uniformly at random without replacement, each with a magnitude `M`. Array those `N` transforms into a composition, and apply that composition to the incoming image. That’s RandAugment. Here’s a code sample:
import numpy as np
import albumentations as def rand_augment(N, M):
# N_TFMS=3 here
transforms = [A.HorizontalFlip(p=1),
A.Rotate(M*9, p=1),
A.RandomBrightness(M/20, p=1)] composition = np.random.choice(transforms, N, replacement=False)
return A.Compose(composition)
We apply RandAugment `C` different times on each image. Using the model that is training, select the `S` (`S` < `C`) most useful versions out of the `C` augmented versions of each image. Feed only the `S` augmented versions of each image into the model for training. How do we determine which augmentations are most useful? We forward pass each of the `C` augmentations through the model, and the higher the loss, the more useful the augmentation. That’s MuAugment.
Why does a high loss mean a useful augmentation? Well, a small loss means the model has already learned how to predict that type of image well, so if trained on it further, the model will only pick up incidental, possibly spurious patterns — overfitting. Conversely, a large loss means the model has not learned the general mapping between the type of image and its target yet, so we need to train more on those kinds of images.
So, MuAugment is a way of picking the hardest augmentations and training on those. RandAugment does not work as well because it produces easy and hard augmentations and feeds both into the model. It is therefore more prone to overfitting on the easy augmentations and underfitting on the hard ones. The model learns more generalizable patterns when an algorithm like MuAugment assures extra fitting on the hard augmentations while skipping the easy ones.
You might have thought of a problem with MuAugment. Sometimes the transforms applied on an image are so severe that the image becomes positively inscrutable, altogether losing its target information. So we end up feeding the model pure noise. However, pure noise yields a high loss when fed into the model, so using MuAugment selects for those unrecognizable images if they are created. There’s no simple solution for this issue other than to choose appropriate hyperparameters so as to reduce the generation of inscrutable images, so it’s a good idea to keep the number of transforms in a composition `N` under 4 and the magnitude of each transform `M` under 6.
If you have time, try a grid search. Input a range of values for `M` to find the optimal magnitude To reduce the search space, just pick a value of `N` in the range [2, 4]. As a heuristic, larger models and datasets require more regularization and would accordingly perform better with a greater magnitude `M`. This is because bigger models are more prone to overfit and lengthier datasets have a higher signal-to-noise ratio which should be reduced to an optimal point. So, keep this in mind when sampling the values of `M` in a grid search.
Summary
We surveyed the flora and fauna of data augmentation policy search algorithms. Some stacked another AI model on top of our task. Others used fewer parameters and a random sample of set transforms. The former performs more accurately than the latter, but is much slower. Enter MuAugment: a mix of the AI-model and randomness-based approaches. It randomly samples compositions from a list of transforms and uses only the most useful (i.e. highest loss) ones for training data. For best results, throw varying values of the transforms’ magnitude into a grid search.
If you wish to use MuAugment or RandAugment in your projects, consider using MuarAugment. It is a package that provides a simple API and implementations optimized for speed. I’ll be populating the MuarAugment GitHub with tutorials and post a Medium article explaining its use soon.
Citations:
- Cubuk et al. 2019, “RandAugment: Practical automated data augmentation with a reduced search space”.
- Zhang et al. 2020, “Automating the Art of Data Augmentation”.
- Wu et al. 2020, “On the Generalization Effects of Linear Transformations in Data Augmentation”.
- Zhang et al. 2019, “Adversarial AutoAugment”.
- Hataya et al. 2019, “Faster AutoAugment: Learning Augmentation Strategies using Backpropagation”.