# Improving AI models through Automatic Data Augmentation using Tuun

# Authors: Petuum CASL Team

Machine Learning (ML) and Deep Learning models improve in accuracy and generalizability when trained with more data. However, finding sufficient data for your ML task can be challenging — data may be restricted because of security and privacy concerns, or it may be expensive and time-consuming to acquire and label at scale.

** Data augmentation** can address these challenges by making better use of existing data, by synthesizing new training examples from existing ones. At Petuum, we use data augmentation to improve our AI applications, such as multi-lingual chatbots, industrial process optimization systems, and visual defect detection tools. Data augmentation helps us reduce classification error by 4.5% in a defect detection system for images of steel sheets, in which shortages of data for certain class labels are a major challenge.

To apply data augmentation to an ML task, we create a ** data augmentation policy**: a sequence of data augmentation steps. In the figure above, new image training examples are generated by applying different operations, such as translation, rotation, or colorization. Furthermore, data augmentation is not just limited to image data: with time-series data, we can apply interpolation, embeddings and Fourier transforms; with natural language text data, we can translate sentences into different languages, or paraphrase them.

However, finding optimal augmentation policies is time-consuming. Not every policy will work well, and new datasets often require different policies, costing even more time. What if we could solve this problem by automatically searching for data augmentation policies? In this blog post, we’ll show how to use Tuun (an open-source tool in the CASL OSS ecosystem) to automatically search for data augmentation policies. Tuun uses Bayesian Optimization to efficiently plan and execute a sequence of trials using different data augmentation policies. By modeling the outcomes of these trials, Tuun can automatically skip unpromising trials and quickly zero in on a data augmentation policy that achieves a high validation accuracy on the prediction task at hand.

# How Tuun Data Augmentation Works

Finding data augmentation policies can be expressed as a **Hyperparameter Optimization (HPO)** problem, which Tuun solves using Bayesian Optimization. As illustrated above, Tuun starts by selecting a data augmentation policy P to train the model, yielding accuracy R on the validation dataset. Tuun uses R as feedback to improve its selection of future trials, and steadily converges to an optimal augmentation policy.

# Getting Started with Tuun

Tuun takes advantage of Microsoft NNI (Neural Network Intelligence) to easily launch data augmentation jobs. To apply Tuun on an existing model’s training script, follow these instructions:

**Install** **Tuun**

Clone and `cd`

into Tuun’s git repository:

`git clone git@github.com:petuum/tuun.git && cd tuun`

Install required Python packages by running:

`python -m pip install -r requirements/requirements_dev.txt`

Run the following command to compile a required Stan model:

`python tuun/probo/models/stan/compile_models.py -m gp_distmat_fixedsig`

**Install NNI**

`python -m pip install --upgrade nni`

**Tuun training code into an NNI trial**

Create a `Search Space`

file in `json`

format according to the experimentation setup:

The search space above will guide the search for image augmentation policies. In this example, an image augmentation policy is made up of 5 sub-policies, with two operations per sub-policy:

Each operation is associated with a type, magnitude, and probability of being applied. Here’s some examples of type and magnitude that Tuun might come up with while searching:

If you are interested in the details: each operation is defined by a triple (*type*, *probability,* *magnitude)*, where *type* is a commonly used image processing operation (e.g. rotate, contrast, invert, etc.), and *probability* is the chance to apply this operation. In total, the data augmentation policy contains 5 sub-policies * 2 operations = 10 triples (*type*, *probability*, *magnitude*). During model training, at the beginning of each mini-batch, one out of the five sub-policies will be randomly selected (with uniform 20% chance) and applied to the entire mini-batch of data. For each data sample in the mini-batch, each of the two operations in the selected sub-policy is independently selected with chance given by *probability*. In other words, some data samples might experience no operations (i.e. they remain unchanged), while others experience one or even two operations.

**Modify training loop to report results to NNI:**

**Replace the fixed augmentation policy with a searchable one, whose configuration is sampled from NNI’s search space:**

Implementation of

`Policy`

class can be found here.

**Set up NNI to use Tuun as a tuner by creating an ****NNI config**** in ****yaml**** format:**

Only the tuner section is shown above, follow this link to view the complete configuration file.

**Start an NNI experiment:**

`nnictl create --config nni_config.yml`

**Experimentation and results**

Following the steps above, we are able to quickly setup and run experiments to find effective data augmentation policies for the three datasets: SVHN, CIFAR10 and CIFAR100.

## Model

The model we choose to use is a modified version of ResNet18 based on the BaseNet framework. The advantage of this implementation is that it converges to 94% accuracy much faster than the regular ResNet18 model on CIFAR10. This is critical because the experiments involve training the model from scratch many times.

## Experiments

Recall the HPO problem illustrated earlier, a complete searching process can be enclosed by an NNI Experiment, inside which each NNI Trial represents one of the iterations. Specifically, we conducted 60 trials inside each experiment and each trial involves training the model from scratch for 50 epochs. The performance of each experiment is determined by the trial with the best test accuracy.

## Expert Tuned Policy (ETP)

The model comes with an expert tuned data augmentation policy (ETP), which is tuned by ML experts to optimize model performance on CIFAR10/100. During the period of the past 12 years after its first release in 2009, CIFAR10/100 has been studied by tens of thousands of researchers and many areas, including data augmentation policies, have been exhaustively explored. It is fair to say that ETP is nearly the optimal hand tuned solution that boosts the model’s performance almost as much as a data augmentation policy could do.

## Results

The table above shows the test accuracies for all the experiments. The “No Aug” column contains the baseline performances of the model without applying any data augmentation steps at all. Clearly, “No Aug” always yields the worst test accuracy. “ETP” stands for the setting where only Expert Tuned Policy is applied. Figures under “Tuun” are the accuracy scores of the best data augmentation polices generated by Tuun (result of the NNI experiments mentioned above). Similarly, “Tuun + ETP” means that the data augmentation polices generated by Tuun are appended to ETP as extra data augmentation steps.

The following conclusions can be drawn from the table:

- Data augmentation is an effective method to improve model performance given that “No Aug” scores are always the lowest.
- Comparing “Tuun” and “ETP”, one may conclude that Tuun is able to generate data augmentation policies nearly as good as the optimal hand tuned solution.
- Column “Tuun + ETP” shows that Tuun is able to further improve the model performance on top of an existing good data augmentation.
- In the case where no suitable ETP is provided (in our case, ETP is not tuned for SVHN), Tuun is able to generate polices that beats the non-optimal policy.

**Other Applications of Tuun**

Tuun is a general-purpose black-box optimization system for various applications including and beyond data augmentation. At Petuum, we’ve used Tuun for AutoML tasks such as hyperparameter optimization (HPO) and neural architecture search (NAS). Tuun provides an interface to specify details about the search space and Bayesian optimization routines for each of these optimization tasks. See here for some additional examples of Tuun in action.

Tuun also supports distributed tuning! If you want to speed up the trials, consider distributed tuning with AdaptDL, another CASL open-source project that’s available as a convenient NNI plug-in.

**Looking Forward**

There are a number of interesting directions we plan to pursue, and we invite potential collaborators and contributors to help us develop the following:

**Parallel data augmentation search:**Perform data augmentation with high trial parallelism for efficient data augmentation policy discovery in a large distributed system.**Transfer learning for data augmentation:**warm start the models used in our tuning procedure with results from previous data augmentation searches and use past data to guide (and increase the speed of) the search procedure.**Integrate with hyperparameter optimization:**combine data augmentation search with methods of hyperparameter optimization, and implement methods for search over this rich product of search spaces.

# About CASL

CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.

Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and Forte announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!