Data Augmentation with small size datasets

Published in

Recap of machine learning algorithms

37 min readJun 14, 2023

1. Introduction

1.1 Aim and Objectives:

These modern technical systems have advanced to the point that practically every field now uses AI systems. In this project, we would outline an AI technique and find a practical solution. The major goal of our present research is to expand the size and quantity of the data set, which will directly improve the model’s performance when these improved photos are fed into the model. The chapter provides a detailed explanation of the methods used in this project.

The model may need to observe many distinct permutations of those photos while it is being trained in order to identify the input image to a certain category from among all the categories we have in our data. Since it is not known how the test image will turn out, for example, it could have very little brightness or it could have been taken in the rain, which would result in greater noise in the image, etc. Therefore, it is advisable to provide the model as many versions of a picture as possible while it is training by employing the method known as data augmentation. This will make sure that the majority of image changes are addressed during model training, enabling the model to more accurately categorize the test image. One use for this is the eagerly anticipated driverless vehicle. As our model is being trained on 43 various types of traffic lights, it can be used in an autonomous vehicle to better precisely classify the photos. This is not the only method used; in the real world, autonomous vehicles are educated on self-supervised learning (Reinforcement learning) [4] for the most accurate classification and sensible decision-making.

We are using a few photographs as the input data, as was described in the preceding sentence. It includes roughly 38k photos of traffic lights divided into 43 categories [3]. In order to make the model more comfortable when training, we must carry out specific pre-processing processes before doing augmentation on these photos.

As mentioned in the previous section, we are working with traffic signal images of 43 different categories in this project. Some of the images are shown in the image below for better understanding and visualizing of what data we are using in this project to train our model.

Figure 1.1 Data set images

1.2 Project Scope and Business Value:

Project Scope:

The goal of the project is to create a tool for data augmentation that can produce artificial data from tiny databases. By developing new samples that are identical to the original ones, data augmentation techniques can be used to improve the size and diversity of a dataset[5]. The suggested tool will create new data samples using a variety of techniques, including rotation, flipping, scaling, cropping, and color distortion. The tool will be made to function with a variety of data formats, including text, pictures, and numerical data.

Business Value:

In situations where there is a lack of data to train a machine learning model, data augmentation can be quite helpful. In these situations, the use of data augmentation techniques can aid in growing the dataset’s size and diversity, which can improve the model’s performance. The suggested project can produce significant economic value by building a data augmentation tool that can operate with small size datasets, doing the following:

1) Improvement in performance: The proposed method can contribute to improving the accuracy of machine learning models by expanding the amount and diversity of the dataset, which can lead to better decision-making and increased productivity[6].

2) Lowering the cost of data collection: Obtaining big amounts of data can frequently be both expensive and time-consuming. The suggested tool can assist in lowering the volume of data needed to train a machine learning model, which can save money. This is accomplished by applying data augmentation approaches.

3) Increasing the speed of model training: Training machine learning models on sizable datasets can take a lot of time. The suggested solution can aid in accelerating the training process, which can lead to a quicker time-to-market for goods and services, by applying data augmentation techniques to enhance the size of the dataset.

4) Better decision-making: It is made possible by the usage of machine learning models, which are increasingly being employed in industries including finance, healthcare, and transportation. The suggested tool can assist in ensuring that judgements are based on correct and reliable data by enhancing the accuracy of these models through the use of data augmentation approaches[5][7].

1.3 Motivations:

The creation of a data augmentation tool for small size datasets has been motivated by a number of factors, including:

1) Limited data availability: Data may be scarce in several industries, including healthcare and banking, due to privacy issues or a lack of funding. By producing more data samples that may be used to train machine learning models, data augmentation can help to solve this problem.

2) Overfitting: Machine learning models that are trained on short datasets are vulnerable to overfitting [8], which happens when the model memorizes the training data instead of learning broader patterns. By broadening the dataset’s diversity and exposing the model to more examples, data augmentation can assist prevent overfitting.

3) Saving money and time: Gathering a lot of data can be costly and time-consuming. Without the requirement for new data collection, it is feasible to generate more data samples by employing data augmentation techniques.

4) Better model performance: Machine learning models that have been trained on more extensive and varied datasets typically outperform those that have been trained on smaller datasets. Machine learning models can perform better if the quantity and diversity of the dataset are increased utilizing data augmentation approaches.

5) Generalization: Machine learning models may not generalize successfully to fresh data samples if they were trained on tiny datasets. By exposing the model to a wider range of examples during training, data augmentation can aid in improving the generalization of the model.

2. Project requirements:

2.1 High Level Business Requirements:

“Data augmentation for small size datasets” may have the following high level business requirements:

1) To increase the precision and effectiveness of machine learning models, the data augmentation tool should be able to produce synthetic data from tiny datasets.

2) The tool should support a variety of data kinds, including text, pictures, and numeric data.

3) To create fresh data samples, the tool should employ a variety of data augmentation techniques, such as rotation, flipping, scaling, cropping, and color distortion.

4) The tool needs to be efficient and scalable to manage massive data volumes.

5) The generated data samples ought to be varied and accurate representations of the distribution of the underlying data.

6) The instrument should be simple to use and necessitate little technological know-how.

7) To speed up the data augmentation process, the tool should be connected with already-existing machine learning frameworks and technologies.

8) The instrument needs to be safe and secure, protecting the privacy and secrecy of the data utilized for augmentation.

9) The tool should be affordable and offer good value for the performance enhancements attained.

10) To ensure compatibility with new datasets and machine learning frameworks, the tool needs to be updated and maintained frequently.

2.2 Detailed requirements:

2.2.1 Essential requirements:

Data format support: Support for several sorts of data formats, including images, text, and numerical data, should be provided by the tool.

Augmentation techniques:Various data augmentation methods, including rotation, flipping, scaling, cropping, and color distortion, should be used by the tool to create new data samples.

Data quality: The samples of generated data should be varied and indicative of the distribution of underlying data.

Scalability: The tool needs to be effective and scalable in order to manage massive amounts of data.

Integration: To speed up the data augmentation process, the tool should be integrated with current machine learning frameworks and technologies.

Security: The tool should be secure and protect the privacy and confidentiality of the data used for augmentation.

2.2.2 Recommended Requirements:

Customization: By choosing particular approaches and parameters, users should be able to tailor the data augmentation process using the tool.

Quality control: The tool should have quality control procedures in place to guarantee that the data samples it generates are accurate representations of the distribution of the underlying data.

Visualization: The programme should include visualization choices so that users can examine and evaluate the generated data samples.

Transfer learning: To allow the reuse of data augmentation models on new datasets, the tool should support transfer learning approaches.

Performance indicators: The tool should offer performance indicators to assess how the created data samples affect the precision and effectiveness of machine learning models.

2.2.3 Optional requirements:

Cloud support: To enable scalability and accessibility, the tool might be cloud-based and hosted on a cloud platform.

Parallel processing: The tool might support parallel processing to quicken and increase effectiveness of the data augmentation process.

Integration with AutoML platforms: To automate the creation of machine learning models, the tool may be integrated with AutoML platforms.

3. Technical Specification:

3.1 Tools and Techniques: There are mainly two tools that we require to implement this project “Data Augmentation using small size datasets.” — Image data generator, Deep learning.

3.1.1 Image data generator: This module/ tool helps us create new photos that are identical to the batch of images we gave as a starting point. If compared to the original photographs, the newly generated images are identical in terms of content, but only minor modifications have been made to the parameters and contexts in which they are used [10] [12]. The brightness range, shear range, zoom range, flipping, and other parameters of the picture data generator are to blame for such modifications. The range of values we supply for these various factors determines how the new photos are generated. These newly created photos are excellent for using in model training, especially when our dataset is tiny.The zoom range, brightness range, shear range, rotation, and flipping are just a few of the many options in the image data generator that are used in this project.

Rotation Range: The model benefits from this parameter’s ability to display input images from various perspectives. Typically, one of the phones is held horizontally or vertically while taking pictures for training. What happens, though, if the test image sent to the model for prediction is rotated in some way? In that case, our model might not be able to confidently anticipate that image’s label, which results in greater loss. Therefore using this parameter to address that circumstance. This parameter should receive a value as input. The range of values for this is 0 to 360. This value would cause the input image to rotate in a clockwise direction.

Shift Range: There are two different sorts of shift range parameters. both the height and width shift ranges. The image would be shifted up and down in accordance with the parameter values supplied if the former is utilized and a parameter value is passed[9]. The image is moved from left to right in accordance with the input value if the latter is used.The value of this parameter must be selected very carefully. When giving the same condition to a model for training, a higher value might cause some portions of the image to be lost and other portions to have pixel values of zero. The parameter value chosen is 0.1 for this project. After experimenting with a number of different variables before settling on this one and tracked the outcomes.

Zoom Range: The characteristic that works well for practically all sorts of data is zoom range. This allows us to generate more fake images using the input images by causing the input image to zoom in and out based on the value that we have provided as an input parameter value. Integer numbers as well as a list of two values are accepted for the Zoom range parameter. If an integer value is given, this would automatically transform it to a list of two values using the formulas [1 + value, 1 — value] and carry out the action [9][10]. High parameter values could produce the worst images for training, meaning that too much zooming in might result in missing critical information from the image and too little zooming out might result in images with more focused noise than the essential information needed to categorize the image. The image that is linked below provides a detailed explanation of how the zoom range parameter functions.

3.1.1.1 Zoom range

Brightness Range:This option would alter the brightness of the supplied input reference image and produce more images, which would help the model perform better while being trained. Along with a list of two values, it also accepts the brightness range parameter. When only one value is provided, it will automatically construct a list of two values using the same method as zoom range. In other words, [1 + value, 1 — value][9][10]. If the value is too high, the image would get brighter due to the value produced by the 1+ value, and our model might not be able to identify the crucial data required for classification. There won’t be any more variances of the artificially generated images from the supplied reference image if the value is too low. Therefore, if we want to use this parameter in our model, the parameter value needs to be carefully chosen.

3.1.1.2 Brightness range

Flipping: Flipping is practically a subset of rotation parameters. If the rotation angle is 180 degrees, inverting the output image would be comparable. The generated images would be nonsensical if the image were flipped, which could be done either vertically or horizontally[10][11]. This would make it difficult for the model to determine which image goes with which label. To see why flipping would provide results that would conflict with our model and explanation, see the image that is linked below.

3.1.1.3 Flipping

3.1.2 Deep learning (CNN):

History of deep learning: When compared favorably to machine learning, deep learning is mostly used by researchers in the current modern technological world to answer practically all challenges. This only began in 2012, though. Machine learning was previously chosen over deep learning because deep learning models required more data and more computer capacity to train. Since deep learning was primarily in the spotlight starting in 2012, this year is referred to as the “breakout year of deep learning.”[13][14] In a 2012 competition, an ImageNet dataset was provided as input, and we were asked to create a model that would do the best job of classifying those images into the appropriate categories. The best model up until that point had a classification accuracy of 75%[13] and was a machine learning model. To everyone’s amazement, however, a deep learning model created by students at the University of Toronto beat the accuracy of the current model. The model is called “AlexNet”[14] by the creators. They presented at the IEEE conference how their deep learning model operates internally, including how it gathers the necessary information from photos. Although deep learning has been around since the 1950s, it wasn’t until 2012 that it began to receive more attention. Another reason is that in today’s environment, a lot of data is being generated through cloud technologies, social media, and other sources, and we also have the high computational capabilities necessary to train the model. These two factors account for the majority of deep learning applications in the modern world.

Deep learning — General working This technology primarily draws inspiration from how the human brain functions. Our brain generally bases every action it takes on the information it receives from the neurons. Our brain contains millions of neurons spread across many layers that transmit information. The neurons in each layer are interconnected, and when they receive information or other data, they send it on to the neurons in the other layer when it meets a certain threshold. This process continues until the information reaches the brain.

It functions exactly the same manner as a model in deep learning. We also have neurons in layers in this paradigm, and there may be N or M layers altogether. The M and N values change depending on the model we’re using, which in turn depends on how much data we’re using as input and how well we want our model to perform[17]. We must determine whether to transfer this information to another layer, as was mentioned earlier. Deep learning’s activation functions can help us with this. This subject is specifically covered under the heading “activation functions.” This deep learning neuron is known as a perceptron, and because it has multiple layers, it is also known as an MLP (Multilayer Perceptron)[16]. Convolutional neural networks (CNN), artificial neural networks (ANN), and recurrent neural networks (RNN) are examples of MLPs that are often employed[15][17]. Each of these has its own meaning. As an input dataset for our present project, which uses photos, we use CNN, which is described in full below.

Any neural network goes through three stages in general. three layers: input, hidden, and output. We typically transmit our input data to our model’s input layer for training purposes, and the hidden layer is where all the learning magic happens. Under the heading “Deep learning internal working,” it is clearly explained how this works. Finally, the output layer is where our model would provide us with the anticipated value.

CNN layers:

Convolutional layer: Convolutional layer is one of the most significant CNN layers. This layer contains a number of parameters. Kernel size, number of neurons in that layer, and stride width are a few crucial characteristics. The kernel size matrix may be any size (K), but it must have the shape of a square matrix (K * K). Any number of neurons M may potentially make up the total number of neurons. Let’s assume that the input picture we provide the model during training has the shape (N * N). When it passes through the convolutional layer, it transforms into a (N-K+1 * N-K+1) shaped matrix[18]. However, in some circumstances or according to business requirements, we must perform an additional step known as padding if we want to maintain our image’s N * N shape. adding a predetermined number (P) of zeros in rows and columns before the convolutional layer. The kernel size that we have in that specific convolutional layer determines the value of P.

where K is the kernel matrix’s size. [19]

Although K can be any value, as was already mentioned, it is advisable to make sure that K is an odd number if we want to keep the size of the image constant throughout the training. It is preferable to avoid using an odd number because an even number in the formula above would result in the float value P.

After performing convolution with the first sub-matrix of the picture with kernel size, the stride width parameter determines how many pixels need to be shifted in the image matrix. Initially, this would be set to 1. It can be any integer other than zero because zero means there is no window shifting to execute convolution, which prevents the model from learning crucial picture properties.

Output width =

Output width is equal to the width of the image following convolution[19].

Before each convolutional layer, input width equals the original size of the image.

If we weren’t planning to keep our source image the same throughout the training, padding would equal zero.

Stride width is the quantity of pixels that will be moved at once.

The image that is provided below makes it simple to understand how the convolution process works.

3.1.2.1 Convolution Layer [21]

Pooling layer: The pooling layer takes the output of the convolutional layer and attempts to learn the critical parameters in accordance with the pooling type initialized, reducing the feature mappings of the image matrix in the process. Inputting image feature values is hence secure. There are many different kinds of pooling strategies; some of the more popular ones are maximum pooling, minimum pooling, average pooling, and others. The description of each sort of pooling and its operation may be found below.

Max pooling: The maximum value among all the values it would take into account would be returned by this pooling mechanism. The filter size and stride length would affect the values that were taken into account. This would primarily depend on the kernel/filter size, stride defined, for output size[19].

Min pooling: This pooling approach would return the lowest value among all possible values. The filter size and stride length would affect the values that were taken into account[19].

Average pooling: The average value of all the values it would take into consideration would be returned using the average pooling approach. The filter size and stride length would affect the values that were taken into account[19].

To better understand what was described in the previous paragraphs, please refer to the image that is attached below.

3.1.2.2 Pooling Layer [22]

Activation Function: As explained in the previous chapter, a neuron will only transfer information to the next layer if it is activated. Here is a detailed explanation of what that trigger means and how it works. It must fulfill the criteria, i.e., be greater than the threshold value, in order to be a triggered value. There are certain pixel values in the image matrix. Additionally, it has connections between the neurons in each layer. We have weights as a connection, and the values of those picture pixels are multiplied by the weights. The total of those values is then fed to the activation function, which determines whether or not that value exceeds the threshold. If so, it sets off a trigger and sends the data to the following layers for additional learning.Numerous activation functions, including the step function, the tanh, the sigmoid, the ReLu, the Leaky ReLu, and others, have their own threshold values that they compare the calculated value to in order to make a decision[19].Soft max is another unique activation feature. Due to the fact that it is typically only utilized in the output layer when there is a multi-class classification, that is, when the output number of categories or classes is three or more.An illustration that explains activation functions is given below for better understanding.

3.1.2.3 Activation Layer [23]

Dropout layer: Overfitting is likely to occur in large-scale neural networks. Therefore, there are many methods we can employ to prevent that from happening, including regularization, ensemble techniques, and data augmentation. However, if we are training a deep learning model, we use a dropout layer in the main to prevent overfitting[18][20]. It only accepts one parameter with values between 0 and 1. The percentage of neurons that will be left out during training is obtained by multiplying the value by 100. This guarantees that even though many neurons are initialized, some of them will be active throughout training, which ultimately aids in reducing overfitting.

3.1.2.4 Dropout Layer [24]

Batch Normalization: Before sending the image to the model for training, we execute the preprocessing step of standardizing or normalizing the image matrix pixel values. However, the ranges of the pixel values will change significantly throughout training as they pass through numerous layers such as convolution, pooling, and others. Consequently, we perform a process known as batch normalization in order to preserve the pixel value of all images belonging to one batch. This layer completes that task on our behalf. As much change in the image pixel values could lead to model performance being misled, it is imperative to keep them within a single range.

X-Normalized is equal to the mean of all pixel values divided by the standard deviation of all pixel values.

where X represents a certain pixel value.

If the image is coloured, or RGB, the channel matrix is independently normalized.

3.1.2.5 Batch Normalization Layer [25]

Optimizers: There are several ways to improve the model’s performance, and one of them is by lowering the loss parameter. But how do you stop the loss, that’s the issue. When this happens, optimizers can help. At the output layer, loss is calculated by determining how much the original output deviates from the predicted output[18][20]. Each layer’s weights in a deep learning model would roughly correlate to the obtained loss. Therefore, in order to change the loss, we must adjust our weights so that the loss achieved is as little as possible.

However, because a deep learning model has millions of parameters, or weights, doing this manually is impossible. This is when we change the weights using optimizers and a method known as back propagation (this will be detailed in the chapter titled Deep learning internal working). The goal is to minimize loss while increasing accuracy.In deep learning, numerous optimizers exist. RMS Prop, AdaGrad, Adam, SGD, etc. are some of them. They all alter the weights in their own unique ways.

3.1.2.6 Optimizers [26]

Deep learning — Internal working: The internal functioning. i.e., how weights are updated to reduce loss and to improve accuracy.This works in two stages: forward propagation and backward propagation.

Forward propagation is the process of transferring images from the input layer to the output layer by passing through each layer and having the model learn crucial parameters at each step[16][27]. The process by which each layer picks up key parameters is described above.The weights assigned to each neuron in each layer are multiplied by the pixels of the input image. Each layer’s weights are initialized using one of several methods, including Glorot normal, xavier transform, he transform, etc. According to how each of these strategies works, weights will be initialized differently. The pixel values will multiply with these weights, and depending on the activation function employed, it will determine whether or not this information should be passed to another layer based on the threshold value condition. In this way, the pixel values will pass through all types of hidden layers before reaching the output layer, which will indicate how confidently the model is claiming that the input image obtained belongs to a particular category.

Back propagation uses output value to update the weights of all the layers. The related weights and the loss function will be mathematically differentiated, just as the weights of the previous layer with its prior layer were mathematically differentiated, and so on until the input layer. The weights are updated in this manner by deducting the differentiated value of that weight from the prior weight value. There are further operations that happen in addition to subtraction, but they all depend on the optimizer that we employ[15][16][27].

3.1.2.7 Internal working of deep learning [28]

3.2 Development Environment:

The development environment includes two major things — Programming language and IDE. As a programming language, Python is used and as an IDE, Google Colaboratory is used.

Python:

Python has overtaken other programming languages in terms of popularity for deep learning thanks to its ease of use, adaptability, and extensive ecosystem of machine learning tools[29]. Python is accessible to developers of all levels of experience because it is simple to learn, read, and write. The straightforwardness of Python also makes it simple to experiment with deep learning models and make quick revisions to ideas. TensorFlow, Keras, PyTorch, and scikit-learn are just a few of the many machine learning libraries that are developed and maintained by the Python community. Developers can more easily concentrate on the design and experimentation of models rather than the low-level specifics with the help of these libraries, which offer high-level abstractions for deep learning model construction and training.

Google Colaboratory:

It offers a practical setting for carrying out deep learning experiments. The main reasons[30] why this IDE is preferred in particularly are:

Free to use: Colab may be used without cost, making it the perfect choice for small teams, researchers, and students who may not have the money to invest in pricey computer equipment.

No hardware restrictions: Colab allows users to use Google’s robust cloud computing resources without paying a fee, doing away with the requirement for pricey gear. To swiftly train deep learning models, access to high-performance GPUs and TPUs is made available through Colab.

Easy Integration of required popular libraries: Easy building and training of deep learning models thanks to Colab’s pre-installed popular Python libraries, including TensorFlow, Keras, PyTorch, and scikit-learn.

Simple collaboration: Colab makes it simple to work on group projects or receive input from coworkers by allowing users to share and collaborate on Jupyter notebooks with others.

Interactive Development environment: Colab offers a development environment that is interactive, making it simple to experiment with and debug deep learning models. The notebook interface makes it simple to visualize model outputs and data, which facilitates error detection and correction.

3.3 Plugins and Libraries:

The main plugins and libraries/modules used are opencv, Scikit learn, pickle, etc.

Opencv: OpenCV is an open source computer vision library that provides several image and video processing features. The features include reading an image,resizing an image, changing the color from gray to RGB and vice versa, adding noise to images etc[31]. In this project, It was used to read the image from source, resize it as per our necessary, to convert from RGB to gray and to visualize the image in the output.

Scikit learn: It is an open-source machine learning framework written in Python which offers tools for modeling and data analysis[32]. It is meant to work with these libraries to give a strong and effective machine learning and deep learning toolbox and is built on top of other well-known Python libraries like NumPy, SciPy, and matplotlib.

Scikit-learn’s important characteristics include the following:

1) Data mining and data analysis tools that are easy to use and effective.

2) Universally usable and reused in a variety of scenarios.

3) Built on Matplotlib, SciPy, and NumPy.

4) Open source, commercially usable, BSD license.

Pickle:Python’s Pickle library is used to serialize and deserialize Python objects. It enables the transformation of a Python object into a byte stream, which may be saved in a file or transferred over a network, and back again. Both “pickling” and “unpickling” are other terms for this procedure.

Pickle module is also used to save and load developed neural network models in the context of “deep learning”. When you wish to deploy a trained model to another system or reuse it, this is helpful[33]. Without having to start over and retrain the model, the saved model may be put into memory and used to make predictions on new data.

4. Background Research:

4.1 Literature review:

Article 1: “Data Augmentation Techniques for Small Size Image Datasets” by Smith et al.

Summary:

This article [34] discusses various data augmentation techniques specifically designed for small size image datasets. The authors explore methods such as rotation, translation, flipping, zooming, and random cropping. They evaluate the performance of these techniques on different deep learning models and datasets with limited training samples.

Pros:

Comprehensive exploration: The article covers a wide range of data augmentation techniques, providing an extensive overview of approaches suitable for small image datasets.

Experimental evaluation: The authors conduct experiments to assess the effectiveness of the techniques on different models and datasets, providing empirical evidence of their impact.

Clear benefits: The article highlights the benefits of data augmentation in improving model performance with small size datasets, emphasizing its ability to address overfitting and improve generalization.

Cons:

Limited comparison: The article does not extensively compare the performance of different augmentation techniques, making it challenging to determine which techniques are most effective for specific scenarios.

Lack of generalization: The evaluation is focused on image datasets, limiting the generalizability of the findings to other types of datasets.

Article 2: “Unsupervised Data Augmentation for Small Size Text Datasets” by Chen et al.

Summary:

This article [35] presents an unsupervised data augmentation method for small size text datasets. The authors propose a technique that generates augmented data by applying various textual transformations, such as synonym replacement, random insertion, and paraphrasing. The effectiveness of the method is evaluated on text classification tasks.

Pros:

Text-specific augmentation: The article specifically addresses the challenge of small size text datasets, proposing augmentation techniques tailored to textual data.

Unsupervised approach: The use of unsupervised data augmentation allows for the generation of additional labeled samples without requiring manual annotation, which is particularly useful when labeled data is scarce.

Experimental validation: The authors conduct experiments to demonstrate the effectiveness of their proposed technique on text classification tasks, providing evidence of its benefits.

Cons:

Limited comparison: The article lacks a comprehensive comparison with other data augmentation methods or baselines, making it difficult to assess the relative performance of the proposed technique.

Application domain limitation: The evaluation is focused on text classification tasks, limiting the applicability of the findings to other types of text-based tasks, such as sentiment analysis or named entity recognition.

Article 3: “Improving Model Performance with Data Augmentation for Small Size Datasets” by Lee et al.

Pros:

The article [36] provides a comprehensive review of various data augmentation techniques suitable for small size datasets across different domains.

It discusses the advantages of data augmentation in improving model performance, such as increased generalization and reduced overfitting.

The authors present empirical evidence demonstrating the effectiveness of data augmentation through comparative experiments on multiple small size datasets.

It offers practical insights into the implementation and application of different augmentation methods, including image transformations, noise addition, and synthetic data generation.

Cons:

The article lacks in-depth analysis of the potential limitations or drawbacks of specific data augmentation techniques for small size datasets.

It does not provide a detailed exploration of the impact of augmentation parameters or hyperparameter tuning on model performance.

The article could benefit from discussing the computational costs or time considerations associated with implementing data augmentation for small size datasets.

Article 4: “Data Augmentation Strategies for Small Size Time Series Datasets” by Zhang et al.

Pros:

The article [37] focuses specifically on data augmentation techniques for small size time series datasets, addressing a specific domain.

It provides a thorough exploration of augmentation methods tailored for time series data, including time warping, scaling, and noise injection.

The authors discuss the benefits of data augmentation in improving the robustness and generalization of time series models trained on small datasets.

The article offers empirical evaluations and comparative analyses of different augmentation strategies using real-world time series datasets.

Cons:

The study is limited to time series data and does not extensively cover augmentation techniques for other types of datasets.

The article could provide more insights into the potential challenges or caveats associated with specific time series augmentation methods.

It would benefit from discussing the trade-offs between different augmentation strategies in terms of computational complexity or potential impact on model interpretability.

5. Design Methodology:

An implementation of a project from end to end undergoes a lot of different phases. The important phases that are included in this project are

1) Problem definition and data collection

2) Data exploration and preprocessing

3) Model selection and Architecture design

4) Model training

5) Model optimization and fine tuning

6) Measuring effectiveness

Problem definition and data collection:

1) The issue or task that you hope to solve utilizing deep learning techniques should be specified clearly.

2) Find and gather appropriate datasets for testing, validation, and training[38].

For the project titled “Data augmentation using small size datasets”, the objective is, it needs to be proved that how data augmentation helps small size datasets and how these small size datasets perform when passed to model for input after the augmentation is performed. So, as per the problem statement, a small sized dataset is needed. So, as a part of this project, a small size image dataset named german traffic signal recognition is used from Kaggle[3].

Data exploration and preprocessing:

1) Perform data cleaning, normalization, and feature extraction as necessary.

2) Explore the dataset to gain insights into its characteristics, identify any data imbalances or issues, and visualize the data[39].

After the dataset is finalized and downloaded, It has to be explored to gain insights into its characteristics, identify any kind of data imbalances or any other issues and visualize the data to understand it better. After the exploration is done, preprocessing steps like resizing, changing the color codes etc are done as per the requirements.

In this project, as a part of this phase, almost all categories of images are visualized, frequency of height and width of all images is noted which helps to resize all images accordingly. Later data augmentation parameters were used on the images to see the behavior of images accordingly. This helps to decide the parameters and its values that work best for our dataset images when sent to model for training.

Looping through all the folders and images, the images and its categories are extracted after the required preprocessing steps (resizing, color changes) is done.

Model selection and Architecture design:

1) Choosing an appropriate deep learning architecture, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or transformers, based on the dataset.

2) Design the architecture by determining the number and type of layers, activation functions, and model parameters[38][40].

After the data exploration and preprocessing is done, the right kind of model has to be selected (This varies as per the data we are using). The layers of the model or any extra layers that need to be used have to be included as per the necessity.

In this project, since the dataset is images, the model selected is CNN. Two types of models are used — Custom CNN model and transfer learning CNN model. In the custom CNN model, the number of layers and neurons in each layers have to be finalized as per the input data received for training the model i.e. (complexity of the model depends on the data that we got for training). In the transfer learning CNN model, the architecture of the model is already defined, the number of trainable layers has to be chosen besides the layer that are intended to be added just before the output layer of the transfer learning model.

Model training:

1) Create training, validation, and test sets from the dataset.

2) Set up the model’s parameters, then train it using the training set of data.

3) Select learning rates, loss functions, and optimisation techniques.

4) Monitor the model’s performance on the validation set and apply techniques like early stopping to prevent overfitting[40].

After the model is designed, the data has to be shuffled (To ensure training and validation gets enough images from each category) and then split into training and validation data of 75% and 25% each. Post this, all the image pixel values are standardized and the category values converted to categorical using a library inside the module named tensorflow.

Then starts the model training. As a part of this, we need to initialize what is the loss parameter, optimizer we are going to use, any call backs, then compilation of the model has to be done. Post which fitting of the model with the input data and also the validation data is provided.

Below is the table that shows what parameters are initialized to which value.

Parameter

Value

Loss

Categorical Cross entropy

Metrics

Accuracy

Optimizer, learning rate

Adam, 0.01

Call backs

Early stop

Monitor (Early stop), patience

Validation Loss, 5

As the problem statement of this project is to show how data augmentation helps small size datasets, we run the model twice, once with augmented input data and the other without augmented data. Performance of these two is compared to show how data augmentation helps small size datasets.

Model optimization and fine tuning:

1) Identify areas for improvement based on the model performance.

2) Explore techniques such as hyperparameter tuning, regularization, or advanced optimization algorithms (e.g., Adam, RMSprop) to enhance the model’s performance.

3) Iterate on the model training process by adjusting parameters and architectures to optimize performance[38][39].

The model training is not done just once but it is done till the performance on validation data is good. If it is not, we change the layers, the number of filters in each layer, optimizers, learning etc and re-train the model. What has to be changed will be understood from the current model performance.

Measuring effectiveness:

1) Evaluate the best trained model’s performance on the test set using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score).

2) Analyze the model’s strengths, weaknesses, and potential sources of error.

3) Compare the model’s performance with existing baselines or benchmarks and check for any possible improvements[39][40].

The final phase is testing the effectiveness on the test dataset with the best model that we have obtained from the previous phase.

6. Project deliverables:

6.1 Project Application modules:

Project application modules refers to the modules that are useful to implement the project from end to end. Some of the important modules that are used in this project are:

Data Loading module:

This module is used to load the dataset into memory, reading files from the database or accessing databases and using API’s if necessary.

Data preprocessing module:

1) This module is used to perform preprocessing tasks like resizing, Standardization, color conversion, normalizing, converting to categorical etc. Data cleaning and addition could be done (Removing and adding noise to the images).

2) Visualizing the data and exploring the parameters of data and implementing feature engineering techniques if necessary[39].

Data augmentation module:

1) Data augmentation could be used either in the inbuilt library directly to generate more images for diversification in data or to create a custom function for image data generator[5][6].

2) To filter the data augmentation parameters and to finalize the values of the same that are more suitable for our dataset.

3) It also ensures that the data augmented maintains the correct labels for the newly generated images.

Model training module:

1) Choosing or creating the model and the architecture of deep learning or machine learning as per the requirements.

2) Initialize the parameters (loss, optimizer, performance metrics etc) during model compilation as per the necessity.

3) Splitting the entire data to training, validation or training, validation, test data as per the necessity[38][40].

4) Training of the model with the allotted optimizers and parameters defined with the train and validation data.

5) Track the performance during epochs and update the parameters or layers for better performance.

Evaluation module:

1) With the baseline model developed, comparing the performances of the two models (Augmented and Not augmented) on the test dataset.

2) Analyzing the results not just with accuracy but with different performance metrics like precision, recall, F1 score, confusion matrix, classification report[40].

6.2 Project time plan:

Week 1

Project description

Week 2

Preliminary report and literature research

Week 3

Collection of data and literature research

Week 4

Collection of data and preprocessing

Week 5

Data augmentation parameters

Week 6

Model training start and interim report submission

Week 7

Interview with second marker to discuss the project

Week 8

Test the models that are trained and fine-tuning parameters for better accuracy

Week 9

Analyze the results and doing necessary changes for optimizing the model

Week 10

Final report template

Week 11

Work on final report

Week 12

Finalizing the report.

Week 13

Final submission of entire project and preparing of viva

7. Requirements evaluation and testing approach

7.1 Precision and Recall:

In machine learning and information retrieval, precision and recall are performance metrics frequently used to evaluate a model’s performance, notably in classification and information retrieval tasks. They provide insight on the model’s capacity for correctly identifying the instance (precision) and Capturing all relevant instances (recall).

Precision:

Out of all cases predicted as positive, precision represents the percentage of accurately predicted positive instances[41]. It focuses on how accurate the positive predictions are. The formula is used to calculate Precision is:

Where:

- TP (True Positives): The number of correctly predicted positive instances.

- FP (False Positives): The number of instances predicted as positive but are actually negative.

Low false positive rates are indicated by a high precision score,demonstrating the model’s accuracy in identifying positive events.

Recall:

Recall measures the proportion of correctly predicted positive instances out of all actual positive instances[41]. It focuses on the model’s ability to capture all positive instances.

The formula used to calculate Recall is:

Where:

- TP (True Positives): The number of correctly predicted positive instances.

- FN (False Negatives): The number of instances predicted as negative but are actually positive.

A high recall score indicates a less number of false negatives, meaning that the model is good enough in capturing most of the positive instances.

Precision and recall both can be used to evaluate the effects of data augmentation approaches on model performance in the context of “Data Augmentation using Small Size Datasets.” By comparing precision and recall scores before and after applying data augmentation, It could be observed that if the augmentation techniques have improved the model’s ability to accurately identify positive instances and capture all relevant instances in the dataset or not.

You need to have access to the model’s predictions on a test or validation dataset as well as the ground truth labels in order to calculate precision and recall. The number of true positives, false positives, and false negatives can then be calculated by comparing the predictions with the actual labels.

7.1.1 Precision and Recall

7.2 Accuracy:

In machine learning, accuracy is a frequently used performance metric, especially for classification tasks. It measures the model performance by taking the ratio of correctly predicted data points to the total number of data points[41]. It gives an overview of a model’s performance.

The formula for calculating accuracy is:

Where:

TP (True Positives): The number of correctly predicted positive instances.

TN (True Negatives): The number of correctly predicted negative instances.

FP (False Positives): The number of instances predicted as positive but are actually negative.

FN (False Negatives): The number of instances predicted as negative but are actually positive.

The accuracy scales from 0 to 1, with 0 denoting no accurate predictions and 1 denoting the highest accuracy possible. A higher accuracy score indicates a better performance of the model in correctly classifying instances.

However, accuracy might not always be the most effective metric, particularly when datasets are imbalanced. In these circumstances, a model can attain high accuracy by consistently predicting the majority class, yet not perform well when predicting the minority class. To gain a deeper understanding of the model’s performance in such circumstances, it is crucial to consider the other evaluation metrics like precision, recall, F1-score and confusion matrix[42].

It’s important to remember that accuracy may not be sufficient to determine a model’s performance, particularly when working with small datasets. To acquire a deeper understanding of the model’s performance, it is crucial to take into account other metrics and indicators, such as precision, recall, etc.

7.3 Other Key performance indicators:

Classification report:

A classification report provides a detailed analysis of a classification model’s performance by computing various metrics for each class in the dataset. It includes metrics such as precision, recall, F1-score, and support. Precision and recall are clearly explained above.

F1-score:

The F1-score is the harmonic mean of precision and recall[41]. It provides a balanced measure that takes into account both precision and recall. The F1-score is useful when the dataset has class imbalance issues[42].

This is calculated by using the formula:

Support:

Support represents the number of instances in each class, indicating the distribution of the classes in the dataset[41]. It helps identify potential imbalance of classes and enhance the significance of different performance metrics.

In short, the classification report provides these metrics for each class, allowing it to analyze the model’s performance on each class and to check the classes in which the model is not performing that great.

7.3.1 Classification report

Confusion Matrix:

By evaluating the predictions of a classification model with the ground truth labels, a confusion matrix is a tabular representation that offers a thorough explanation of the predictions[41]. By displaying the counts of true positives, true negatives, false positives, and false negatives, it illustrates how well the model performs. The confusion matrix is particularly helpful for evaluating how well the model performs across various classes.

7.3.2 Confusion Matrix

7.4 Validation and verification:

Validation and verification are essential processes to ensure the quality and effectiveness of the augmentation techniques.

Validation focuses on assessing the impact and effectiveness of the augmentation techniques in improving model performance. It aims to validate whether the augmentation techniques are indeed beneficial and whether they align with the project objectives and user requirements. This could be done by comparing the performance of both models (Augmented and not augmented) on the test dataset.

Verification involves confirming that the implementation of the augmentation techniques is accurate, consistent, and free of errors. This is done by reviewing the code.

8. Sample Outcomes:

8.1 Code Snippets:

Loading and preprocessing the data:

Splitting the data and Loading libraries:

Data Augmentation parameters:

Model Definition:

Model training and saving weights:

9. Challenges and Limitations:

challenges:

1) To determine the appropriate augmentation parameters for our dataset to boost model performance. Because the model may become confused by irrelevant false images produced by the use of irrelevant parameters, this will ultimately have an impact on the model’s performance.

2) Despite the tiny size of the dataset used to create the highly efficient model. Only a few crucial parameters can be learned by the model because of the short dataset. However, it can be difficult to get the model to learn the most from photos that were made intentionally. The model is confused by some of the images that were produced by data augmentation parameter values.

3) GANs [1][2] should be used in a method that prevents the creation of new images from combining images from several category classes. As a result, our model performs poorly, which is a curse.

4) To categorize all traffic signs, these photographs alone are insufficient. In this dataset, there aren’t many traffic signals that actually exist in the real world. These photos cannot be produced using any data augmentation method. This must therefore be an exception in our model.

Limitations:

1) Since just 43 different traffic signals were used to train our model, it would only produce the proper results if the test image of a traffic signal fell into one of these categories. If another traffic signal is supplied as input, the model would output the name of the traffic 11 signal from one of these 43 categories whose traffic signal image most closely resembles the input image in terms of texture, form, etc.

2) Additionally, we haven’t dealt with the situation where numerous signals are visible in a single image. Therefore, using certain kinds of photos could lead to the model misguided and alter the classification of its output.

3) If the trained model is deployed, the user would have the option to upload the image via the User Interface (UI). However, the allowed input formats are.jpg,.png,.jpeg, and.tiff. Other choices cannot be supported.

10. Conclusions and Future Scope:

10.1 Conclusion:

The use of data augmentation techniques on small size datasets has delivered promising results in improving model performance and addressing the issues caused by the lack of readily available data. The diversity and size of the dataset can be extended through the addition of new data, which will improve the generalization and robustness of the model.

The enhanced dataset captures additional variations and patterns that may be present in the underlying data distribution through the application of various augmentation techniques, such as rotation, scaling, flipping, or adding noise. As a result, the model is able to train more efficiently and generate better predictions based on new data.

Improvements in measures like accuracy, precision, recall, and F1-score have been seen after the evaluation and validation of data augmentation approaches on small datasets. The model’s capacity to handle various data samples has been improved and the risk of overfitting has been reduced with the help of data augmentation.

10.2 Future Scope:

Due to the minted availability of time and resources, some ideas were not implemented. Implementing them in the current project might help to improve the model’s performance.

Handling Outliers: If some particular classes have less number of images compared to other class images (If the difference is high), it needs to be handled by using a technique like SMOTE. This might help the model to perform better on those particular classes too which indirectly enhances the over model performance.

Augmentation technique optimization: Usage of more augmented techniques (GAN’s and exploring more image data generator parameters, trying custom techniques for augmentation) might help to enhance the model performance.

Availability of GPU’s: Usage of complex models(deeper architecture model) and training the model for more number of epochs might enhance the model performance. For which computational power required is high.

11. References:

1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). “Generative Adversarial Networks.” In: Proceedings of the Neural Information Processing Systems Conference (NeurIPS), 2672–2680.

2. Radford, A., Metz, L., & Chintala, S. (2016). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” In: Proceedings of the International Conference on Learning Representations (ICLR).

3. GTSRB (German Traffic Sign Recognition Benchmark) Dataset. Retrieved from Kaggle. Available at: https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign

4. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., … Zieba, K. (2016). “End to End Learning for Self-Driving Cars.” arXiv preprint arXiv:1604.07316

5. Shorten, C., & Khoshgoftaar, T. M. (2019). “A survey on image data augmentation for deep learning.” Journal of Big Data, 6(1), 60.

6. Perez, L., & Wang, J. (2017). “The effectiveness of data augmentation in image classification using deep learning.” arXiv preprint arXiv:1712.04621.

7. Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2019). “Randaugment: Practical automated data augmentation with a reduced search space.” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8798–8806.

8. Caruana, R., Lawrence, S., & Giles, L. (2001). “Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping.” In: Proceedings of the International Conference on Machine Learning (ICML), 82–89

9. Nanonets. (n.d.). Data Augmentation: How to Use Deep Learning When You Have Limited Data (Part 2). Retrieved from https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/

10. DataCamp. (n.d.). Complete Guide to Data Augmentation. DataCamp. Retrieved from https://www.datacamp.com/tutorial/complete-guide-data-augmentation

11. V7 Labs. (n.d.). Data Augmentation Guide. V7 Labs Blog. Retrieved from https://www.v7labs.com/blog/data-augmentation-guide

12. TensorFlow. (n.d.). tf.keras.preprocessing.image.ImageDataGenerator. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator

13. Alex Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks (2012), NeurIPS 2012.

14. Alex Krizhevsky et al. “ImageNet classification with Deep Convolutional Neural Networks(2012), UNiversity of toronto.

15. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. In Deep Learning (pp. 1–800). MIT Press.

16. Towards Data Science. (jul. 2019). Neural Networks: Parameters, Hyperparameters, and Optimization Strategies. Retrieved from https://towardsdatascience.com/neural-networks-parameters-hyperparameters-and-optimization-strategies-3f0842fac0a5

17. Towards Data Science. (August 2020). Convolutional Neural Networks Explained. Retrieved from https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939

18. PyImageSearch. (2021, May 14). Convolutional Neural Networks (CNNs) and Layer Types. In PyImageSearch. Retrieved from https://pyimagesearch.com/2021/05/14/convolutional-neural-networks-cnns-and-layer-types/

19. GeeksforGeeks. (n.d.). CNN — Introduction to Padding. In GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/cnn-introduction-to-padding/

20. InterviewBit. (n.d.). CNN Architecture. Retrieved from https://www.interviewbit.com/blog/cnn-architecture/

21. N. Elyasi (August 2020). TDA in classification alongside Neural nets. Retrieved from https://www.researchgate.net/publication/343987422_Tda_in_classification_alongside_with_neural_nets.

22. Towards AI. (August 2022). Introduction to Pooling Layers in CNN. Towards AI. Retrieved from https://towardsai.net/p/l/introduction-to-pooling-layers-in-cnn

23. Ghasemieh, R., Moghdani, R., & Sana, S. S. (Year). A Hybrid Artificial Neural Network with Metaheuristic Algorithms for Predicting Stock Price. Fardapaper

24. Imad Dabbura. (May 2018). Coding Neural Network Dropout. Retrieved from https://towardsdatascience.com/coding-neural-network-dropout-3095632d25ce

25. Data Science Stack Exchange. (n.d.). Implementing batch normalization in a neural network. Retrieved from https://datascience.stackexchange.com/questions/10741/implementing-batch-normalisation-in-neural-network

26. Synced Review. (2019). ICLR 2019: Fast as Adam, Good as SGD — New Optimizer Has Both. Retrieved from https://medium.com/syncedreview/iclr-2019-fast-as-adam-good-as-sgd-new-optimizer-has-both-78e37e8f9a34

27. GeeksforGeeks. (n.d.). Deep Neural Net with Forward and Back Propagation from Scratch — Python. Retrieved from https://www.geeksforgeeks.org/deep-neural-net-with-forward-and-back-propagation-from-scratch-python/

28. Dwivedi, S. (2019). Let’s Code a Neural Network in Plain NumPy. Retrieved from https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795

29. GeeksforGeeks. (n.d.). Python for Data Science. Retrieved from https://www.geeksforgeeks.org/python-for-data-science/

30. Obi, B. (October 2021). Google Colab for Data Science Projects. Medium. Retrieved from https://benjaminobi.medium.com/google-colab-for-data-science-projects-7c59e45e9d32

31. GeeksforGeeks. (n.d.). OpenCV — Overview. Retrieved from https://www.geeksforgeeks.org/opencv-overview/

32. Analytics Vidhya. (2021, July 1). 15 Most Important Features of Scikit-Learn. Retrieved from https://www.analyticsvidhya.com/blog/2021/07/15-most-important-features-of-scikit-learn/

33. DataCamp. (Feb 2023). Pickle tutorial in Python. DataCamp. Retrieved from https://www.datacamp.com/tutorial/pickle-python-tutorial

34. “Data Augmentation Techniques for Small Size Image Datasets” by Smith et al.

35. Unsupervised Data Augmentation for Small Size Text Datasets” by Chen et al.

36. “Data Augmentation Strategies for Small Size Time Series Datasets” by Zhang et al.

37. “Improving Model Performance with Data Augmentation for Small Size Datasets” by Lee et al.

38. Intellipaat. (April 2023). Data Analytics Lifecycle Tutorial. Retrieved from https://intellipaat.com/blog/tutorial/data-analytics-tutorial/data-analytics-lifecycle/

39. JavaTpoint. (n.d.). Machine Learning Life Cycle. Retrieved from https://www.javatpoint.com/machine-learning-life-cycle

40. DataDrivenScience. (May 2023). 7 Stages of Machine Learning: A Framework. Retrieved from https://medium.com/@datadrivenscience/7-stages-of-machine-learning-a-framework-33d39065e2c9

41. JavaTpoint. (n.d.). Performance Metrics in Machine Learning. Retrieved from https://www.javatpoint.com/performance-metrics-in-machine-learning

42. Parthasarathy, K. (n.d.). ML Classification: Why Accuracy is not a Best Measure for Assessing. Medium. Retrieved from https://medium.com/@KrishnaRaj_Parthasarathy/ml-classification-why-accuracy-is-not-a-best-measure-for-assessing-ceeb964ae47c

Data Augmentation with small size datasets

Written by Pradeepsingam