Recently, I have started learning about Artificial Intelligence as it is creating a lot of buzz in industry. Within these diverse fields of AI applications, the area of vision based domain has attracted me a lot. For that, I have been experimenting with deep learning mechanisms primarily involving usage of Convolutional Neural Network(CNN). The primary thing with all the experiments I have done till date has taught me that data which is used during training plays the most important role. In fact, it will not be wrong to state that AI has emerged again (after several AI winters) only because of availability of huge computing power(GPUs) and vast amount of data in Internet. More luckily for me, much of the data is available in the form of images and videos.
Inspite of all the data availability, fetching the right type of data which matches the exact use-case of our experiment is a daunting task. Moreover, the data has to have good diversity as the object of interest needs to be present in varying sizes, lighting conditions and poses if we desire that our network generalizes well during the testing (or deployment) phase. To overcome this problem of limited quantity and limited diversity of data, we generate(manufacture) our own data with the existing data which we have. This methodology of generating our own data is known as data augmentation.
I have been experimenting with various deep learning frameworks and all My additional question is has anyone done some study on what is the maximum number of classes it gives good performance. Consider, data can be generated with good amount of diversity for each class and time of training is not a factor.these frameworks are giving in-built packages for data augmentation. To state a few of the frameworks, Keras has ImageDataGenerator (needs least amount of work from us), Tensorflow has TFLearn’s DataAugmentation and MXNet has Augmenter classes.
In this article, let us explore few of the most commonly used image augmentation techniques with code examples and visualisation of images after augmentation. From here onwards, data will be referred to as images. We will be using Tensorflow or OpenCV written in Python in all our examples. Here is the index of techniques we will be using in our article:
- Rotation (at 90 degrees)
- Rotation (at finer angles)
- Adding Salt and Pepper noise
- Lighting condition
- Perspective transform
But before any technique: Image Resizing:
Images gathered from Internet will be of varying sizes. Due to presence of fully connected layers in most of the neural networks, the images being fed to network will be required of a fixed size (unless you are using Spatial Pyramid Pooling before passing to dense layers). Because of this, before the image augmentation happens, let us preprocess the images to the size which our network needs. With the fixed sized image, we get the benefits of processing them in batches.
Having differently scaled object of interest in the images is the most important aspect of image diversity. When your network is in hands of real users, the object in the image can be tiny or large. Also, sometimes, object can cover the entire image and yet will not be present totally in image (i.e cropped at edges of object). The code shows scaling of image centrally.
We would like our network to recognize the object present in any part of the image. Also, the object can be present partially in the corner or edges of the image. For this reason, we shift the object to various parts of the image. This may also result in addition of a background noise. The code snippet shows translating the image at four sides retaining 80 percent of the base image.
Rotation (at 90 degrees):
The network has to recognize the object present in any orientation. Assuming the image is square, rotating the image at 90 degrees will not add any background noise in the image.
Rotation (at finer angles):
Depending upon the requirement, there maybe a necessity to orient the object at minute angles. However problem with this approach is, it will add background noise. If the background in image is of a fixed color (say white or black), the newly added background can blend with the image. However, if the newly added background color doesn’t blend, the network may consider it as to be a feature and learn unnecessary features.
This scenario is more important for network to remove biasness of assuming certain features of the object is available in only a particular side. Consider the case shown in image example. You don’t want network to learn that tilt of banana happens only in right side as observed in the base image. Also notice that flipping produces different set of images from rotation at multiple of 90 degrees.My additional question is has anyone done some study on what is the maximum number of classes it gives good performance. Consider, data can be generated with good amount of diversity for each class and time of training is not a factor.
Adding Salt and Pepper noise:
Salt and Pepper noise refers to addition of white and black dots in the image. Though this may seem unnecessary, it is important to remember that a general user who is taking image to feed into your network may not be a professional photographer. His camera can produce blurry images with lots of white and black dots. This augmentation aides the above mentioned users.
This is a very important type of diversity needed in the image dataset not only for the network to learn properly the object of interest but also to simulate the practical scenario of images being taken by the user. The lighting condition of the images are varied by adding Gaussian noise in the image.
In perspective transform, we try to project image from a different point of view. For this, the position of object should be known in advance. Merely calculating perspective transform without knowing the position of the object can lead to degradation of the dataset. Hence, this type of augmentation has to be performed selectively. The greatest advantage with this augmentation is that it can emphasize on parts of object in image which the network needs to learn.
Though the above list of image augmentation methods is not exhaustive, it comprises of many widely used methods. Also, based on the use-case of the problem you are trying to solve and the type of dataset you are already having, you may use only those types of augmentations which add value to your dataset. You can combine these augmentations to produce even more number of images.
I would like to conclude here that using the limited quantity and limited diversity in dataset we have produced adequate amount of images with variations such that our network can learn meaningful features from the image dataset. You can check the code used in this article directly in the Github repository.
Do let me know if you use some other type of image augmentation which is simple and widely used by you through the comments. Also feel free to make any suggestions or mistakes you find in my approach.