TensorFlow 2.0 Tutorial : Optimizing Training Time Performance

Raphaël Meudec
Sicara's blog
Published in
3 min readJan 30, 2020

This tutorial explores how you can improve training time performance of your TensorFlow 2.0 model around:

  • tf.data
  • Mixed Precision Training
  • Multi-GPU Training Strategy

I adapted all these tricks to a custom project on image deblurring, and the result is astonishing. You can get a 2–10x training time speed-up depending on your current pipeline.

Usecase: Improving TensorFlow training time of an image deblurring CNN

2 years ago, I published a blog post on Image Deblurring with GANs in Keras. I thought it would be a nice transition to pass the repository in TF2.0 to understand what has changed and what are the implications on my code. In this article, I’ll train a simpler version of the model (the cnn part only).

The model is a convolutional net which takes the (256, 256, 3) blurred patch and predicts the (256, 256, 3) corresponding sharp patch. It is based on the ResNet architecture and is fully convolutional.

Step 1: Identify bottlenecks

To optimize training speed, you want your GPUs to be running at 100% speed. nvidia-smiis nice to make sure your process is running on the GPU, but when it comes to GPU monitoring, there are smarter tools out there. Hence, the first step of this TensorFlow tutorial is to explore these better options.

nvtop

If you’re using an Nvidia card, the simplest solution to monitor GPU utilization over time might probably be nvtop . Visualization is friendlier than nvidia-smi , and you can track metrics over time.

nvtop screenshot

TensorBoard Profiler

TensorBoard Profile

By simply setting profile_batch={BATCH_INDEX_TO_MONITOR} inside the TensorBoard callback, TF adds a full report on operations performed by either the CPU or GPU for the given batch. This can help identify if your GPU is stalled at some point for lack of data.

RAPIDS NVDashboard

This is a Jupyterlab extension which gives access to various metrics. Along with your GPU, you can also monitor elements from your motherboard (CPU, Disks, ..). The advantage is you don’t have to monitor a specific batch, but rather have a look on performance over the whole training.

Here, we can easily spot that GPU is at 40% speed most of the time. I have activated only 1 of the 2 GPUs on the computer, so total utilization is around 20%.

Step 2: Optimize your tf.data pipeline

The first objective is to make the GPU busy 100% of the time. To do so, we want to reduce the data loading bottleneck. If you are using a Python generator or a Keras Sequence, your data loading is probably sub-optimal. Even if you’re using tf.data, data loading can still be an issue. In my article, I initially used Keras Sequences to load the images.

You can easily spot this phenomenom using the TensorBoard profiling. GPUs will tend to have free time while CPUs are performing multiple operations related to data loading.

Making the switch from the original Keras sequences to tf.data was fairly easy. Most operations for data loading are pretty well-supported, the only tricky part is to take the same patch on the blurred image and the real one.

--

--