11. Introduction to Deep Learning with Computer Vision — Image Augmentations

Published in

Deep-Learning-For-Computer-Vision

16 min readJun 3, 2020

Written by Praveen Kumar & Nilesh Singh.

It’s been a little while since our last publication. In this article, we will see how data augmentation is very important in terms of model deployment and how it changes your strategies during the training phase. We will also see several data augmentation techniques, which we like to categories them as Rich man’s data augmentation(RMDA), Middle man’s data augmentation(MDDA) & Poor man’s data augmentation(PMDA) 😜 . Let’s get started.

WARNING ⚠️-> This article is going to deeply extensive. Make sure you grab a cup of coffee ☕️.

Github links are attached based on the Augmentation technique.
An excellent resource for you to implement general augmentation

What is Data Augmentation?

Data augmentation is a strategy that enables us to significantly increase the diversity of data available for training models, without actually collecting new data. Let’s take a sample image.

In the above image, we see, in our training set, we just had a single image of a lion but we produced 6 more images based on simple operations such as zooming, cropping, rotations.

Why Data Augmentations?

When models are built during the training phase, they are only exposed to learn a sample of the population, whereas the real world contains more diverse samples. Hence, if a model simply trained on collected images, it will fail badly during deployment.

For example, if you are training a model to recognize faces, then it is probable you collect all the face images which are under good lighting conditions & the face is properly aligned to the camera. However, In real life, such scenarios are possible but it is also possible that people do not pose straight to the camera, or the lighting conditions may change. So, your model will fail as it has never seen these new scenarios during its training phase and a model is only capable of producing results based on the patterns it has learned during its training.

NOTE: “Poor man”, “rich man” & “middle man” names are used to express the intuition behind the strategies based on computations & time complexity. We do not wish to categorize or discriminate against any race, ethics, or work done by any person in any field.

Types of Data Augmentations

I. Poor Man Data Augmentations

I mean why is it called poor man data augmentation? Because Data Augmentation is a compute-intensive strategy and a poor man can not afford high compute power. Hence he has to settle with the strategies which give him a bit better results without having to spend much computer power & time.

Techniques such as:

Scaling

Translation

Rotation

Blurring

Image Mirroring

Color Shifting / Whitening

falls under the poor man’s data augmentation. These techniques do not require intensive compute power and are best fit in scenarios where the test environment is unchanged and only minor changes on the image object are seen. Image classification such as MNIST handwritten digit’s & Character recognition, Number plat recognition in daylight seems to be the right fit for such augmentations.

An excellent resource to learn and implement the above data augmentation techniques: [LINK]

II. Middle Man Data Augmentations

These strategies are mostly used in day to day world nowadays. People having decent enough GPU can perform these Augmentation techniques. Let’s discuss each in chronological order.

Elastic Distortion(2003)

Elastic Distortion is a technique in which the input image is deformed by change. Let’s look at an example to get a clear understanding

It is one of the simplest Image augmentation techniques used. PyPI has a package named “elasticdeform” which allows us to easily manipulate the input image. [Check out: PyPI and Github if you want to implement]

Simple distortions such as translations, rotations, and skewing can be generated by applying affine displacement fields to images. This is done by computing for every pixel a new target location concerning the original location. The new target location, at position (x,y) is given concerning the previous position. For instance, if ∆x(x,y)=1, and ∆y(x,y)=0, this means that the new location of every pixel is shifted by 1 to the right. If the displacement field was: ∆x(x,y)= αx, and ∆y(x,y)= αy, the image would be scaled by α, from the origin location (x,y)=(0,0). Since α could be a noninteger value, interpolation is necessary. [Source]

Dropout(2014)

Dropouts have a very interesting concept behind it. We will dive into it more intuitive but before that, we would like you to know Kernels & Channels to fully understand how Dropouts could be helpful in a Fully Connected layer rather than just CNN layers as a whole.

Let’s take a look at an example. The following image shows the normal behavior of the CNN layer(left).

Now suppose, if we were to blackout a few neurons(set them to 0) in the input layer, what would be the output? Of course, it changes, but it is not all that changes. When we randomly dropout neurons(blackout/set to 0), the model is now forced to find different neurons, which could help it to better classify the input. Each neuron is capable of holding a set of features based on its kernel and if we drop a few of those, the model would rather prefer to classify based on the available neurons.

This way, we are indirectly saying to the model that if were to give you 2 images of a person, one perfectly fine and second one with all the hands blacked out(pixels are set to 0 wherever hand was there in the image), then are you capable of learning to classify a person even without those hands neurons. This is a more practical scenario, right?

Dropouts could be helpful in over-fitting as stated by the original paper. How? Because we might randomly drop a few redundant features in an image that is very well learned by the model. However, it is a topic of debate. Also remember, dropouts are randomly applied to neurons at each training phase. so it keeps changing the dropping neurons at each training instance. Well, it all seems to just click and work, however, we also assume that we do not randomly drop all the important features. If we do so, then the accuracy may not necessarily increase. Dropouts could be tricky and it would be difficult to judge them based on just neurons. This was recognized by LaCunn in 2015 & he proposed a different strategy to implement Dropouts. He proposed spatial dropouts. Spatial dropouts drop the whole channel itself instead of dropping neurons. Hence, the output of a kernel is not used. Again, this dropping of kernel changes with each training instance. However, spatial dropouts are not frequently used compared to normal dropouts. We have 2 separate functions in Keras framework.

# for importing normal dropout, use
from keras.layers import Dropouts# for importing spatial dropouts, use
from keras.layers import SpatialDropout2D

Batch Normalization(2015)

To understand batch normalization, refer to our article on batch normalization.

Cutout(2017)

The cutout is a very simple augmentation technique. We simply say, cut out a portion in the image. That’s it. How is it helpful? It follows the same reason as we just discuss. If we remove random features from the image, the model looks for other features to be able to classify that image. Hence, cutting out portions in the image forces the model to learn other features as well. This is helpful for the model to learn more features rather than simply learning the dominant features in an image.

If you want to learn the code, you can make use of the following links.

NOTE: You must be understanding what your model is learning to be able to integrate cutout in your project. In case your model is not learning well, cutout might haunt it down further. There are interpretability methods to find out what the model is seeing in an image. One of them is GradCam.

Mixup(2018)

Mixup is a very interesting augmentation. Let’s try to understand it with the help of an example.

Mixup alpha blends two images to produce a new image. This behavior forces the model to predict 2 classes. Now in a single image, we expect the model to predict 2 outcomes. Randomly mixing up images forces the model to learn features for a class. In the above example, 20% of dog features are visible, so the model is forced to detect those features and should be able to say that a dog is present and it is 20% confident of the prediction. [Mixup source]

Mixup is very similar to label smoothing technique. If you don’t know label smoothing, let us give a simple example. Imagine, if we have 2 classes to predict to and our model gives out 1 for the class which it thinks is, or 0 if it thinks no for the class. So we have corresponding 0/1 labels with 0 being no class and 1 being class. if any other values are there, the model simply rejects it. So what label smoothing does is, it tell the models to not reject all values but say let’s take all the values which are greater than 0.95, so that model will reject all the values below 0.95 and convert all the values above 0.95 to 1 and accept it.
Mixup implementation on keras

Smart Augmentation(2017)

Smart augmentation was a very specific image augmentation technique. It was introduced in 2017. It can only be applied to specific classes and it provides great results. It does this by learning to merge two or more samples in one class. This merged sample is then used to train a target network. The loss of the target network is used to inform the augmenter at the same time. This has the result of generating more data for use by the target network. This process often includes letting the network come up with unusual or unexpected but highly performance augmentation strategies.

The only drawback is, it can be only applied to images where the object size is constant and the location of the object is fixed.

Sample pairing(2018)

In this technique, synthesize a new sample from one image by overlaying another image randomly chosen from the training data (i.e., taking an average of two images for each pixel). By using two images randomly selected from the training set, we can generate N² new samples from N training samples.[Source]

This technique could be tricky when you implement it. This will most often give you a low training accuracy, however, when it comes to a validation error, it will help you reduce your testing error. Why so? During the training phase, the model is not able to properly classify both and hence accuracy will be lower. This may discourage you in regards to your model’s performance but now your model is also learning lots of features that could be a part of an object, so during testing, it would help it to classify better. This is what the research paper also quoted. Let’s see some stats.

Fig 15: Error drop after sample pairing [Source]

As we can see, the error reduced with sample pairing during validation but increased during training. This is expected behavior and we do not need to worry about our model’s performance.

RICAP(2018)

RICAP crops four training images and patches them to construct a new training image. It selects images and determines the cropping sizes randomly, where the size of the final image is identical to that of the original image. RICAP also mixes class labels of the four images with ratios proportional to the areas of the four images like label smoothing in the mixup. Compared to a mixup, RICAP has three clear distinctions:

it mixes images spatially
it uses partial images by cropping
and it does not create features that are absent in the original dataset except for boundary patching.

RICAP shares concepts with cutout, mixup, and label smoothing, and potentially overcomes their shortcomings.

As we see, with RICAP, the results are much better.

Fig 18 shows the heatmap of where the model is looking to classify the corresponding input image. The heatmaps of the baseline model are not looking at the right location to decide the output class, but after RICAP, the heatmaps are much accurate and the model is looking at the right location.

RICAP implementation on Keras

CutMix(2019)

In this technique, patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. Instead of simply removing pixels, we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. CutMix now enjoys the property that there is no uninformative pixel during training, making training efﬁcient while retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain the same. CutMix shares similarity with Mixup which mixes two samples by interpolating both the image and labels. While certainly improving classiﬁcation performance, Mixup samples tend to be unnatural. CutMix overcomes the problem by replacing the image region with a patch from another training image.[Source]

The table shows significant improvements in the results in ImageNet classification, Localization, and VOC detections.

Keras implementation of CutMix

We know you feel this way, but just hold on for few more minutes. Knowledge is worth the risk 😜 😉

III. Rich Man Data Augmentations

Very rich kids/companies can afford these kinds of techniques. They cost a lot and require 1000’s of GPU hours to find out the optimal augmentation technique. These are strategies rather than argumentation techniques. We run these strategies on the dataset and pass which image augmentation technique we wish to use. These strategies then test 100’s of values for those augmentations and then find out the optimal values we should use for that augmentation technique. So, these strategies analyze the dataset multiple times and give out those values. Even though we explore our search space to find the best values, it comes with its advantages and disadvantages.

Advantages:

Results can be directly applied to the dataset.
Learned policies can be transferred to other datasets
Archives state-of-the-art results on ImageNet, & Cifar 10, Cifar 100.

Disadvantages:

Takes 1000’s of GPU hours.

Let’s discuss a few rich man’s data augmentations strategies.

AutoAugment(2018)

AutoAugment is a reinforcement based search techniques, that takes a set of values as its space search and explore this space search over multiple augmentation techniques (Also called operations). In the paper, they explain how each search space(which is also called a policy) could have multiple sub-policies. Each sub policy consists of 2 parts. First, which operation to be applied(rotations, zoom, scaling, etc), and second, what values of these operations must be explored. The search algorithm finds the best possible values for each operation. These results, when applied to current state-of-the-art results, improve them by a very significant margin, and gives out the new state-of-the-art results. The following tables depict the same.

The search space algorithm is responsible for finding out the best policy. The search algorithm has two components: a controller, which is a recurrent neural network, and the training algorithm. At each step, the controller predicts a decision produced by a softmax; the prediction is then fed into the next step as an embedding. Let’s see an image to gain more understanding.

Fig 21: Policy and sub policy in AutoAugment [Source]

In Fig 21, we have an original image(policy), and corresponding 5 sub-policies. Each sub-policy has 3 parameters, namely, operation type, probability, and magnitude. In total, the controller has 30 softmax predictions (10*3 batches) to predict 5 sub-policies, each with 2 operations, and each operation requiring an operation type, probability, and magnitude.

Fact: Operations considered are ShearX/Y, Translate X/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout, Sample Pairing. Each operation also comes with a default 10 range of magnitudes. There are roughly 2.9x1032 possibilities. 😅 Now we know where do those 1000’s of GPU hours go.

Moving on…

Optimal DataAug Policies via Bayesian Optimization(2019) [Bayesian Methods — Type 1]

Bayesian optimization type 1 follows a similar approach as we have discussed in AutoAugment. However, a few important differences are needed to be list down:

AutoAugment uses Recurrent Neural Network & Reinforcement Learning for exploring the search space, whereas the Bayesian method uses a global Bayesian optimization algorithm to explore the bayesian space.
AutoAugment uses 5 sub-policies whereas the Bayesian method uses 3 sub-policies.
Each sub-policy in the Baysian method uses a group of 2 consecutive operations [Example: (Rotation, scaling) or (Zoom, translation) … These are random combination but the important point is, they are always 2 consecutive operations] instead of a single operation in AutoAugment.
AutoAugment requires at max 15000’s GPU hours on ImageNet whereas the Bayesian method requires 800 GPU hours.

Bayesian Optimization implementation

Optimal DataAug Policies via Bayesian Optimization(2019) [Bayesian Methods — Type 2]

Type 2 is a very different case than type 1. Do you know about GAN’s? If you do, good and then you also know they are not easy to train. If you don’t know about GAN, do not worry. For now, just understand, its some kind of black boxes, that can learn to produce new images or generate similar images based on an input image.

So…coming back to Type 2 method. Observe the following image.

Fig 22: Bayesian Optimization-based Augmentation — Type 2

In Fig 22, we have a training dataset (Observed training set) and a synthesized training set produced by the generator network (GAN). We concatenate these 2 sets to form an Augmented training set, which is later on used to feed the classifier network. Now we have more training images and we can use them to help our model to be trained on more variant samples. So the variance distribution of the dataset increases. Let’s see example images that are produced by the GAN.

In Fig 23, the first row is the original images and the rest of rows are produced by GAN. Fig 23 a) is a simple MNIST dataset, b) is Cifar10 dataset with classes airplane, automobile, bird and ship, c) is Cifar100 dataset with classes apple, fish, rose, and crab. We can see that the GAN produced similar but little variant images based on the original images. This enlarges the datasets and increases variance.

Bayesian Optimization implementation

Population-based Augmentation(2019)

People with extra resources have got to do something crazy. This augmentation is a set example for that. Population-based augmentation(PBA) follows a unique idea to train our models. Let’s understand why and how.

The general norms of training a model are, to select a set of augmentation techniques, let the model train for 100’s of epochs and get some metrics. So, once we choose a set of augmentation strategies, we randomly shuffle and apply them to our dataset throughout the training phase. But! PBA is from another planet. What PBA proposes is, we make a bag of strategies, and based on the epochs numbers, we will choose different strategies and apply them. So after every N random number of epochs, we change the augmentation strategy.

From Fig 24, we can see that no augmentation strategies were picked from epoch 0–11, and 12–33 has several strategies but with less magnitude. This consumes a lot of resources(less than AutoAugment) but can be very effective if it’s in your power to implement on your project. Look at some state-of-the-art results produced by PBA.

PBA example of “Car” class in the Cifar10 dataset.

Patch-based Gaussian augmentation(2019)

Patch-based Gaussian augmentation(PGA) rightfully points out that previous techniques such as cutout increases clean accuracy but not robustness. What we mean by that is, adding cutout allows the model to learn more features thus helping it to classify the object more cleanly, but when it comes to variations, the cutout is not robust to classify those variations in the image. PGA overcomes the tradeoff between clean accuracy and robustness. By adding a patch, we increase clean accuracy, and by adding Gaussian noise, we overcome robustness. This produces state-of-the-art performance on Cifar10 & ImageNet. The following image shows robustness and clean accuracy stats on ImageNet of ResNet architecture.

Let’s see an example of PGA.

Keras Implementation of Gaussian Augmentation

Now you are free to fly off.

This was by far the most extensive article among all the previous ones. This was tiring but we hope this article would help you understand the right strategies and techniques you need to improve your project. We hope you enjoyed it. See you soon!

NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]