How to artificially augment the amount of data to improve the performance of a computer vision algorithm?

By Théo Dupuis, Data Scientist at LittleBigCode

Over the last years, artificial intelligence has gained immense momentum in diverse areas and neural networks have gained increasingly more focus as the tool of choice for many applications: classification, image generation, natural language processing or object detection, to name a few fields where neural networks have shown unprecedented performances in many ways. Nevertheless, these algorithms often require a large amount of data to be efficient which can be an issue for real-world problems. However, there is a way to artificially augment the amount of data to improve the performance of a computer vision algorithm.

Without a sufficient amount of data, these algorithms are indeed prone to overfitting and will not generalise so well to unseen data. This is a major issue for their development in domains where data are hard and expensive to extract. For example, in the medical realm, most neural networks currently have to use less than a thousand training samples which are considered utterly small in the deep learning field.

How to tackle the issue of overfitting

Different techniques have been implemented within the neural network community. Among them, batch normalisation, dropout and regularization have all shown interesting improvements in performance and are currently widely used in almost every neural network.

Another interesting idea to handle overfitting due to a lack of data is to artificially increase the amount of training data, a technic which is called data augmentation. The most classic algorithms to perform data augmentation in computer vision are the geometric transformation of the image such as random flipping, cropping, translation, and rotation. There also exist technics that modify the brightness of the image or add some noise to it.

Finally, it is also possible to use a subset of training images to create new ones using different interpolation technics. Most of these data augmentation methods are easy to implement using existing libraries and enable us not only to increase the size of the dataset but also to create images that are very different from the original one while preserving their interesting features. Hence, bringing more diversity to the dataset can ultimately help prevent overfitting.

The neural network’s approach

More complex approaches have also emerged such as using neural networks to perform data augmentation. The idea is that instead of limiting the data augmentation to predefined geometric modification, we can use a neural network to create new unseen data. For instance, some propose to use a GAN, a generative adversarial network to create new instances that look a lot like the original one.

This network does not directly modify images but instead learns how to create new ones completely from scratch. These images are supposed to look so real that they can fool another neural network or a human into believing that the fake image is a real one. The term DAGAN is sometimes used in the literature to specify that the GAN is used for data augmentation.

In this article, I’ll focus on understanding the impact of the geometrical transformation on the accuracy of a model. To this end, I’ll present a vanilla classifier that will be trained on the original dataset and will use the accuracy of this model as a reference. I’ll then use the same classifier on datasets augmented with different geometrical transformations and compare the accuracy of those to my reference.

The global background of the test

Our aim

In this report, we will focus on showing the particular assets and downfalls of different data augmentation technics on an image classification task. We will be looking at a simple problem where we have a dataset of medical images to classify between two categories, and we will look at the increase in classifying accuracy we can get by applying all the previously described data augmentation techniques.

In addition, we will try to assess the maximum amount of data we can create from the original data to maximise the performances of our vanilla classifier. The global report will be split into various parts corresponding to the different augmentation methods.

About the Data

We will conduct our experiments with the following dataset, which comes from a Kaggle challenge that everyone can find online: https://www.kaggle.com/andrewmvd/pediatric-pneumonia-chest-xray.

It contains 5,856 Chest X-rays labelled as either pneumonia or normal. This dataset has previously been divided into a training set consisting of 90% of the original dataset (5000 images) and a test set with the rest of the images.

Examples of Chest X-rays used as inputs

The original dataset wasn’t balanced so we decided to drop some of the images to rebalance the training and test sets leading to a training set of about 2,500 images and a test set of 500 images.

From these 2,500 images, we created subsets of 250, 500, 1,000 and 2,500 images to evaluate the impact of the amount of data on the accuracy of the classifier.

It is also important to state that, to better understand the outcomes of the experiment, all subsets are included in their bigger subset. That is to say that the 250 images from the smaller subset are all present in the 500 images subset which in turn are present in the 1,000 images subset, etc.

Dividing the training set into these subsets has different aims, it will enable us to:

  • Quantify the impact of the training set’s size on accuracy;
  • Analyse the impact of the training set’s size on generalization;
  • Find a relation between the amount of augmented data that best improves the performances of the classifier and the amount of original data of the corresponding training set;
  • Compare the performances of the classifier trained with one of the original subsets, and another one which has been trained with one of the smaller datasets on which we performed data augmentation, such that the sizes of the training set are the same in both experiments.

This last point will allow us to understand for instance if the classifier performs best with 500 images including 250 original images and 250 augmented data, or if it performs best directly with 500 original data.

It is also important to state that data augmentation is always performed only on the training set. We want our model to learn from more data, but we absolutely do not want to modify the test set. Hence the data augmentation always takes place after separating the two sets and is used only on the training set.

The classifier

The vanilla classifier used in this report is a pre-existing convolutional network pre-trained on the ImageNet dataset. It is called VGG16 and is widely used in image classification [3]. The head layer has been removed and we added instead a dense layer with two outputs in order to fit our binary classification problem. We will use this classifier for all the tests without changing its structure except when required. For example, we might add at the beginning one or more layers for data augmentation, but they don’t modify the structure of the classifier strictly speaking.

VGG16 architecture

Using a pre-trained model can seem strange as it means that our model doesn’t fit our particular data, but in fact, the pre-trained models are widely used as they are able to extract a lot of features common to every image (corners, edges for example but also deeper features that human cannot always find out). Then the top dense layer that we train will learn how to use these general features to handle this specific problem.

If we want to create a model that is more specific to the considered problem, we can unfreeze a few of the last layers of the original model so they can also be trained. This results in a neural network that is more specific to the considered situation but requires more training to be efficient. We will refer to this method in the different sections as the “unfrozen algorithm” in opposition to the “frozen algorithm”.

We know that unfreezing more layers will lead to more overfitting, especially with such small datasets, which will be interesting to look at in our analysis. Without going into more details regarding the classifier, which is not the main interest of this report, let’s just remember that the different approaches (with and without unfreezing) will be used in order to build a model that is robust and to observe if we can overcome the overfitting issue of the “unfrozen” model.

Focus on the parameters

For all the experiments, we used the same hyperparameters in order to compare similar experiments. This means that the hyper-parameters are not optimized for each dataset to find the best accuracy. However, we used a learning rate scheduler to make sure that we always converge to a meaningful value of accuracy. It is also important to note that the maximal number of epochs has been set to 100 but is never reached with early stopping in place.

Early stopping enables us to get the best out of each experiment and restrains the learning capacity. Indeed, setting a fixed number of epochs would be meaningless as some experimentation would then overfit and give an accuracy that does not correspond to the best they can do. Therefore, the early stopping enables us to compare the best outcome of each experiment with their common parameters and thus to have a solid common base to compare the algorithms.

The results of the test

This section will introduce the outcomes on the raw data (meaning without any augmentation) for each size of dataset both on the train and test set and will serve as references for the rest of the report.

The following table gives the accuracy of the training and test set for the frozen and unfrozen algorithms for each size of the dataset. To avoid biasing the result with small variations in the accuracy due to the randomness of the initialisation and the possibility to stop in local minima, all values are the mean of ten identical and independent experiments.

As we can observe, the accuracy of the test set is increasing along with the number of training examples available. We gained 5% of accuracy by doubling the size of the dataset of 250 images which means gaining 5% of accuracy for only 250 newly labelled images. However, increasing the size of the dataset from 500 to 1,000 inputs and then from 1,000 to 2,500 leads to a performance improvement of only 1% each time. This shows the importance to have a minimum viable amount of data to train the algorithm on.

Regarding the “unfrozen” algorithm now, it has poorer performances than the “frozen” version of it. Increasing the liberty of the weights leads to a strong overfitting in this case which explains the performances on the test set. Nevertheless, this algorithm might find its utility later on as it might provide better accuracy if data augmentation enables us to overcome overfitting.

In the next section, we will see one way to perform data augmentation and its impact on the previously found accuracies.

Classic Data Augmentation Technics

The most common data augmentation technics use geometric transformations along with some modifications to the brightness of the image. These technics are relatively easy to implement and can quickly allow us to double the size of the training set. However, they require to take a close look at the data to make sure that the distribution of the data is not modified by these transformations [1, 4]. For example, a wheel is invariant by any rotation, but if you rotate a cat of 180° then the algorithm might take it for something else. Also, if you translate or crop an image, the key features of the image might disappear. Therefore, these technics do not fit all problems and could even negatively impact your accuracy depending on the problem.

It is also interesting to note that these technics are independent of the data labels, thus a wrongly labelled data point will be doubled but won’t affect the process of data augmentation itself and the proportion of outliers will remain the same after these transformations.

Rotation

Example of rotation

Each image is randomly rotated at an angle between two values, for these experiments I chose to take -70° and 70° giving rise to the images on the right.

Flipping

Example of flipping

This transformation applies a mirror-like flipping into one direction to the image. The default and more commonly used direction (also the one chosen for the experiments) is the horizontal one. Here the image on the left has been flipped, but the one on the right has not because of the randomness of the operation.

Contrast

Example of contrast modification

This transformation increases or decreases the contrast of the image by a random factor, for this experiment between +30% and +30%.

Translation

Example of translation

Each image is translated by a random number of pixels, missing pixels are then set at 0. For these experiments, I chose to use a translation of 10% of the size of the images. We must be careful with this because translation might easily lead to the vanishing of some key features especially if the image is not centred around the region of interest.

What results do we get?

We will now compare the accuracy of each of these transformations on each of our datasets. Once again, the values have been evaluated by taking the mean accuracy over 10 independent experiments. The goal is to analyse the efficiency of these technics and to draw conclusions on how the size of the dataset impacts the gain from data augmentation. The following tables give the mean accuracy for the frozen and unfrozen algorithms for each transformation on both the training and test set. The accuracy of the reference algorithm (without any data augmentation) is also given for comparison purposes.

For now, we limit the number of images generated through data augmentation so that we double the size of the original training set. The aim here is to find which technics work well, and in a second step, we will vary the amount of augmented data in order to find how many new samples it is best to generate.

Overall, the technics increase the accuracy of both algorithms (frozen and unfrozen) but not by a consequent amount. The only truly efficient transformation is the translation which increases the accuracy of both algorithms by 3 to 5 points depending on the dataset. Rotation and contrast on the other hand reduce the performance of some of our datasets showing that the perturbation modifies the distribution of our dataset away from the original distribution.

Nevertheless, these experiments show that, depending on the dataset and problem, the classic data augmentation technics can truly increase the performance of the algorithm. Considering that they are rather easy to implement, it seems to be worth it in terms of cost/performance to implement them and see if they help the problem at hand.

Dataset of size 250
Dataset of size 500
Dataset of size 1000
Dataset of size 2500

Overall, when the augmentation technics triggered an improvement in the performances of the algorithm, creating a dataset with a ratio of 3/4 of original images seems to be the better option. It enables us to gain 4 points of accuracy with the most efficient data augmentation technic among the ones tested here. It may represent a small amount with no real impact in certain fields. But for example in the medical realm, increasing the metric to the highest level possible is paramount. Let’s imagine that we want to detect potential cancerous legions in the bones, missing out on some of them puts the life of the patient at risk. Another application, an industrial application this time, could be to detect deficient industrial pieces. Once again you don’t want to miss any flaw that could impact a bigger project. Therefore, data augmentation is essential for applications where increasing accuracy to an extremely high level of confidence is necessary.

Consult all the articles of LittleBigCode by clicking here: https://medium.com/hub-by-littlebigcode

Follow us on Linkedin & Youtube + https://LittleBigCode.fr/en

--

--