Refurbish Your Training Data for Faster DNN Training

Published in

SNU AIIS Blog

9 min readApr 2, 2022

by Sue Hyun Park

As a deep learning (DL) task becomes more complex, a DL model should have more model weights to boost accuracy. Through deep neural network (DNN) training, the model weights are repeatedly adjusted with regard to the given set of training data. One concern is that state-of-the-art DL models have millions of weights while the number of training samples is typically much smaller. It is critical to increase the training dataset so that trained DL models ensure generalization, the ability to properly process unseen data.

Data augmentation is widely used to do the trick. It is a practice to apply random transformations on existing training samples, providing additional distinct training samples. Take a look at the DNN pipeline below. Until target validation accuracy is met, two steps — data preparation step and gradient computation step — are repeated for multiple epochs. The data augmentation pipeline operates inside the data preparation step. After training data undergo two RandAugment layers, each of which randomly applies one of the 14 distortions (e.g., shear, rotate, and solarize), and a random crop and random flip layer, the size of the dataset increases exponentially.

An epoch is a complete pass on the entire training set.
A step indicates the processing of a subset of samples, which we call a mini-batch, within an epoch.

DNN training pipeline including the data augmentation pipeline. Two RandAugment layers are used.

Data augmentation gives greater variation in the training set and helps train a more generalized DL model. However, the multi-layered computations in data augmentation often heavily burden the CPU and can be responsible for degrading the training speed.

In this post, we outline our recent publication about a new intermediate data augmentation technique. We propose data refurbishing, a novel sample reuse mechanism that accelerates DNN training while preserving model generalization. We also design and implement a new data loading system, Revamper, to realize data refurbishing.

Our research paper appeared at the 2021 USENIX Annual Technical Conference. We will introduce how we analyzed the training speed bottleneck and derived the idea of data refurbishing, and then show Revamper’s architecture and its performance advantages.

Overhead of Data Augmentation

The data preparation step is generally performed on the CPU. On the other hand, gradient computation requires computationally expensive forward computation and backward propagation, necessitating accelerators like GPUs and TPUs. Thanks to the recent development of specialized hardware accelerators such as NVIDIA A100 and Google TPU v3, gradient computation has gained a dramatic speedup. Meanwhile, the augmentation pipeline performs random transformations through multiple layers, which is computationally burdensome to the CPU. The resulting heavy CPU overhead has become the bottleneck of DNN training.

To measure the impact, we analyze the training throughput of ResNet-50 trained on ImageNet using a varying number of RandAugment layers. Without a RandAugment layer, the training throughput reaches the maximum. As the amount of RandAugment layers increase, training throughput notably decreases.

ResNet-50 training speed on ImageNet, varying the number of RandAugment layers. The horizontal line indicates the gradient computation speed on GPU.

Limitations of Existing Approaches and Our Questions

How can we reduce CPU overhead from data augmentation?

There were efforts to reduce computation overhead, but the stochastic nature of data augmentation jeopardized such approaches.

The first approach is using hardware accelerators with massive parallelisms like GPUs and FPGAs. Recent works are NVIDIA DALI and TrainBox. But leveraging massive parallelism is incompatible with the stochastic characteristics of augmentation pipelines.

The second approach is data echoing from Google, which attempts to cut down the amount of computation by reusing training samples. For better understanding, we first illustrate the standard training with an augmentation pipeline.

A high-level illustration of standard training.

In the standard pipeline, the stochastic augmentation is independently applied for each epoch, and augmented images are produced uniquely.

A high-level illustration of data echoing.

In contrast, data echoing caches the augmented samples and then reuse them for successive epochs.

On the one hand, the CPU-heavy augmentation step is executed for limited times, so the overhead should significantly decrease. This method is proven to be useful for resolving slow I/O. On the other hand, using the same augmented sample for multiple epochs damages the stochastic nature of augmentation. This severely hampers the originally intended generalization of trained DNNs, as we will show later in the evaluation.

Here we notice that certain data augmentation techniques can improve the training throughput at the cost of decreasing training sample variety. Therefore, we redefine our research question:

How can we reduce CPU overhead from data augmentation and preserve generalization of the model obtained by data augmentation?

Data Refurbishing

We propose data refurbishing, a simple and effective sample reuse mechanism that keeps the good and throws away the bad of data augmentation. Contrary to data echoing that leaves the whole augmentation pipeline as a black-box operation, data refurbishing splits the augmentation pipeline into two — partial augmentation and final augmentation. The partially augmented samples are initially cached, and then are “refurbished” by the stochastic final augmentation for each epoch. Overall, this method can produce unique augmented samples, maintaining sample diversity while reducing computation overhead.

A high-level illustration of data refurbishing.

We visually explain how data refurbishing preserves the sample diversity produced from the standard training under a specific condition. Setting the sample diversity as the performance indicator, there are two hyperparameters for additional configuration:

The reuse factor r, which represents how many times to reuse each cached sample.
The split strategy, which determines how to split the full augmentation pipeline into the partial and final augmentations.

Through mathematical problem formulation, we plot the relation of the reuse factor, split strategy, and sample diversity in 3-dimension:

The x-axis represents the reuse factor, the y-axis the ratio of the number of transformations of final augmentation to that of the full augmentation pipeline, the z-axis the ratio of the normalized expected number of unique samples to that of the standard data augmentation.

We notice that the configuration of standard data augmentation and data echoing can be represented using our notations. As seen in the plots below, by standard data augmentation, sample diversity is maximized but throughput is not improved at all; by data echoing, throughput is improved by reusing samples but sample diversity decreases. On this account, among the mixes of varying reuse factors and split strategies must exist the “sweet spot” that provides high throughput along with notable sample diversity. Focusing on changes of sample diversity along the y-axis, the ideal split the user should explore is where the final augmentation pipeline consists of transformations providing sufficient sample diversity with little computation. A combination of a random crop layer and a random horizontal flip layer can be a great example. This produces 7 times more augmented samples than one RandAugment layer does, with fewer CPU cycles.

Revamper

Given the findings, we design a new data loading system Revamper that implements data refurbishing with its optimal configuration. It incorporates data refurbishing to existing data preparation procedures of DL frameworks such as PyTorch and TensorFlow, by replacing existing data loading systems such as PyTorch dataloader and tf.data with Revamper.

A naive implementation of data refurbishing, however, suffers from inconsistent batch processing time. During data loading, cached samples require only final augmentation whereas non-cached samples require both partial and final augmentation. In this way, the CPU processing time of each step fluctuates according to the number of cache misses contrary to the consistent gradient computation time on DL accelerators. The discrepancy results in poor computation overlap between the CPU and DL accelerators.

Challenge: inconsistent batch processing time

We design Revamper to maximize the computation overlap by keeping the number of cache misses constant both across epochs and within each epoch. There are two techniques:

balanced eviction to address the inter-epoch computation skew, and
cache-aware shuffle to address the intra-epoch computation skew.

To support our improved data refurbishing features, we design the architecture of Revamper distinguished from traditional data loading systems by the following:

The cache store is added to store partially augmented samples.
The evict shuffler is added in the main process, to select the indices to be evicted from the cache store according to the balanced eviction.
The batch shuffler is modified to sample mini-batch indices according to the cache-aware shuffle.

The architecture of Revamper and its end-to-end data preparation procedures. For further details on the procedure (steps 1~9), you can check out our paper.

We will further explain how balanced eviction and cache-aware shuffle is adopted in Revamper’s modules, managing the CPU processing time for each step constant.

Balanced Eviction

Before starting each training epoch, the evict shuffler samples N/r indices to be evicted, where N denotes the number of training samples and r denotes the reuse factor. Then the evict shuffler samples the indices without replacement and repeats the same sampling order until the end of training. In effect, the same number of partially augmented (i.e., cached) samples are evicted in each epoch in contrast to the naive reference count algorithm. The computational overhead is evenly distributed across epochs after the first epoch. Also, all N cached samples are reused exactly r times after the first r epochs.

Cache-aware Shuffle

After the eviction instructed by the evict shuffler, the main process allocates mini-batch indices to each worker sampled from the batch shuffler. Since Revamper already knows the indices of evicted samples before each training epoch, the batch shuffler distributes samples in a way that each mini-batch has the same ratio of cached to non-cached samples. Both non-cached indices and cached indices are randomly sampled. By doing so, any unnecessary waiting between the CPU and the DL accelerator can be eliminated.

Evaluation

We implement Revamper with 2000+ lines of Python code based on PyTorch 1.6. It overrides the existing PyTorch dataloader with the identical interface except for additional parameters such as the reuse factor and the split strategy.

We evaluate data refurbishing implemented in Revamper against the following baselines:

Standard: DNN training with full augmentation without any reuse mechanism
Data echoing: caching and reusing fully augmented samples
Simplified: A standard setting where one or more transformation layers are removed to reduce the computation overhead of data augmentation

We use identical model hyperparameters (e.g., initial learning rate per batch size, learning rate scheduling, and the number of total training epochs) for each setting.

We train a DL model ResNet-50 on ImageNet with RandAugment, using Revamper and the other baselines. Compared to the standard setting, Revamper shows better training throughput while maintaining comparable accuracy. Data echoing and the simplified setting show a significant decline in accuracy despite throughput gain, which is not a sign of good trade-offs.

Training throughput and model validation accuracy of ResNet-50 trained on ImageNet with diverse settings using RandAugment.

Revamper shows maximal throughput improvement when the CPU resource is scarce. As the performance of DL accelerators rapidly increases, we expect that more training jobs will benefit from Revamper in the near future.

Training throughput of ResNet50 on ImageNet for varying CPU-GPU ratios.

Conclusion

Studies on accelerating DL training will not cease in the upcoming future. The AI model size, dataset size, and the amount of computation are scaling up to an unprecedented magnitude, and reducing the time, money, and efforts in training rises as a major challenge to practitioners. Due to the massive carbon footprint of training AI, it is also imperative to make machine learning cleaner and greener by employing efficient algorithms. (See Green AI)

So, as much as we value higher model accuracy and generalization, we aim to resolve the bottleneck in training introduced by data augmentation. We design a novel sample reuse mechanism, data refurbishing, to achieve this goal. And the associated new data loading system, Revamper, improves the training throughput of DNN models by 1.03x-2.04x while maintaining comparable accuracy.

We hope Revamper will ameliorate the training jobs that incorporate data augmentation. Also, we hope this work will encourage further research to rethink well-studied topics like caching in systems in the new context of deep learning.

Acknowledgment

This blog post is based on the following paper:

Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin and Byung-Gon Chun. “Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training.” USENIX ATC. 2021. (slides and paper)

We would like to thank Gyewon Lee for providing valuable insights to this blog post.

This post was originally posted on our Notion blog, at July 31, 2021.