How Adobe Stock Accelerated Deep Learning Model Training using a Multi-GPU Approach

Landmarks in images are the building blocks of image recognition — identifying and flagging these landmarks, in a service like Adobe Stock, is key to properly categorizing them and identifying any key intellectual property issues with them. As part of the Stock team, we set out to build a content service, using Adobe Sensei, our artificial intelligence and machine learning technology, that would quickly and effectively detect landmarks in the hundreds of thousands of images submitted by Stock contributors, every week.

To avoid any bottlenecks in the process, we needed to move away from a single-GPU machine to a parallel computing approach with multi-GPU machines. We also leveraged the Adobe Sensei Content Framework, which is our internal platform built to accelerate the creation of AI and ML services at Adobe and to foster collaboration and reuse of these intelligent services amongst product teams. The Content Framework leverages best in class open source technologies such as Tensorflow in order to train new services using Adobe’s own data.

Our big goal was to build a classifier service that flags images having landmark-related intellectual property (IP) issues, and we managed to do this using multi-GPU training via TensorFlow. While this is our preferred ML framework, the workflow below explains how we did this and is applicable to any framework. Read on to learn what we did, and what kind of solutions we found to solve for key challenges that deep learning parallelism presents.

Eliminating the bottleneck in our landmark-flagging service

Adobe Stock’s landmark dataset contains more than two million assets, and data augmentation makes it even bigger. With a single-GPU machine, the training job took around 7–8 days, and that caused a major bottleneck that made it difficult to align our deliverables with product releases. At the same time, we wanted to run multiple experiments to tune the model hyper-parameters. To train the model in a reasonable time, it became clear we needed to explore parallel computing with multi-GPU machines.

Deep learning parallelism

In general, a distributed training process runs on multiple GPUs or on multiple nodes/machines. There are several techniques to perform distributed training, two of the most popular being data parallelism and model parallelism. Both techniques require dividing computation across parallel resources, and we explored both of them when deciding how to implement multi-GPU ML training.

Data parallelism

The first technique is data parallelism, where a single model is replicated but given different subsets of data. For neural networks this means using the same weights but different mini-batches for each worker. Further, the gradients need to be synchronized (concatenated) after each pass through a mini-batch. After every few iterations, all replicas synchronize, either with one another (all-reduce) or via a central server (parameter server). This usually scales up nicely and yields algorithmic speedup.

Figure 1: Data Parallelism

Model parallelism

Model parallelism, on the other hand, uses the same data but splits the model layers for parallel training. Model parallelism splits the weights of the net equally among the threads and all threads work on a single mini-batch. Therefore the output after each layer must be synchronized (stacked) to provide the input to the next layer.

This approach is typically helpful for models having large memory footprint. We can split the model across multiple GPUs and have each GPU compute a part of it. This requires careful understanding of the model and the computational dependencies.

Figure 2: Model Parallelism

Due to the challenges inherent in model parallelism, we chose to use data parallelism to speed up model training.

How to implement data parallelism at your organization

Let me share our approach and the advantages it offered. The idea of data parallelism is simple: Start with multiple copies of the model, train them on different subsets of the data, and then synchronize the gradients applied to their weights and biases. In this approach a replica of the entire model is kept on each worker machine or GPU processing different shards of the data on each. Finally, we need a method of combining results and synchronizing the model weights across the workers.

While there are numerous different techniques for implementing data parallelism, parameter concatenation is a simple approach that works well.

Parameter concatenation

Parameter concatenation is the conceptually simplest approach to data parallelism. To execute training, follow the steps below (and see Figure 1 above):

  1. Initialize the network parameters.
  2. Distribute a copy of the current parameters to each worker/GPU.
  3. Split the dataset into as many batches as there are GPUs.
  4. Train each worker on a subset of the data.
  5. Concatenate the parameters from each worker/GPU.
  6. Set the global parameters to the concatenated value in step 5.
  7. Go to step 2 if there is more data to process.

So, for example, if we have eight GPUs and a mini-batch with 256 examples, we will:

  • Split the data into eight batches of 32 examples per GPU.
  • Feed each batch through the net and obtain gradients for it.
  • Concatenate all the gradients and update the parameters.

The challenges of implementing data parallelism

The biggest challenge with the data parallelism approach is that, during the backward pass, we have to pass the whole gradient to all other GPUs. Because of this, data parallelism does not scale linearly with the number of GPUs.

The best way you can reduce the parameters of the gradient is through max pooling, ‘maxout’ units, or by simply using convolution. Another way to improve scalability is to increase the computational time/network time ratio by other means, for example by using optimization techniques like RMSProp. We need the same time to pass the gradients to each other, but more time is spent on computation, thus increasing the utility of the fast GPUs.

Despite these challenges, if you understand the bottlenecks of the model, data parallelism can significantly improve your training time. One notable experimentation by Alex Krishevsky achieved a performance improvement of 3.74x using four GPUs and 6.25x using eight GPUs. His system features two CPUs and eight GPUs in one node; he could use the full PCIe speed for the two sets of four GPUs and a relatively fast PCIe connection between CPUs to distribute the data among all eight GPUs efficiently.

Data parallelism support in Keras/TensorFlow

In a multi-GPU setup, it’s often best to synchronously update the model by storing the weights on the CPU DRAM. But in a multi-machine setup, we often use a separate ‘parameter server’ that stores and propagates the weight updates (see Figure 2). To run TensorFlow on multiple GPUs, you can construct the model in a multi-tower fashion where each tower is assigned to a different GPU.

In general, training on a bunch of GPUs in a single machine is much more efficient. Empirically it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine — but distributed training lets you scale to even larger numbers, and harness more CPU power.

Multi-GPU training on Keras/TensorFlow

The process of creating a multi-GPU model on Keras/TensorFlow is to divide the inputs and replicate the model into each GPU, as I mentioned above. You can use CPUs to combine the results of each GPU into one model. Be sure to note that the number of GPUs required to run the model must be an even number, as we split the incoming batch into equal chunks for each GPU.

Figure 3.1: Multi-GPU model

In the figure 3.1, you can see the multi-GPU model, while the figure 3.2 shows the extracted lambda layer in a multi-GPU model.

Figure 3.2: Multi-GPU model with extracted Lambda layer

The end results of multi-GPU training

The end results show a dramatic improvement in training time with our baseline Xception and VGG19 models. Here are the experiment details:

Figure 4: Xception model training performance
Figure 5: VGG19 model training performance

Currently, it takes one or two days for us to train our models, compared to a whopping seven or eight days before. This is massive 80 percent drop in the training time, with no impact on model accuracy.

We also learned an important lesson along the way: It’s important to keep all the GPUs busy for successful multi-GPU training. If you don’t plan carefully, a multi-GPU setup can even take more time than a single-GPU machine, since GPU idle time will offset the performance gains and gradient sharing will affect the training time negatively. In order to do that, we used Keras’s multiprocessing and batch generator queues to ensure that the batches are ready to feed proactively with all the data augmentation transformations applied by Keras. You can use similar approaches with other frameworks, of course. We hope that our experience with multi-GPU training is helpful to you as you think about approaches to future AI/ML projects!

References: