Distributed Transfer Learning with 4th Gen Intel Xeon Scalable Processor

Leverage Hardware and Software Optimizations for TensorFlow to More Efficiently Train Deep Learning Models

Lakshmi Arunachalam

Published in

Intel Analytics Software

5 min readOct 6, 2023

Lakshmi Arunachalam, Fahim Mohammad, and Vrushabh Sanghavi, Intel Corporation

Imagine how kids learn color with crayons. It may take few days for them to learn how to hold the crayon, stay within the picture, and so on. They go through lots of crayons and coloring books before they get the hang of it. Then, they can easily apply those skills to colored pencils, pastels, paint, etc. They don’t have to start from scratch because they already have a foundation from coloring with crayons. This is what transfer learning is about. Instead of starting from scratch and needing more time and resources, we can use skills already learned and fine-tune them for a similar task.

In the world of artificial intelligence, transfer learning has emerged as a powerful technique. In this article, we explore how transfer learning, coupled with Intel Xeon Scalable Processors, specifically 4th Gen Intel Xeon Scalable Processor, defies the conventional belief that training is best done on a GPU. We present a case study that achieves near state-of-the-art accuracy for image classification on a publicly available image classification TensorFlow dataset using Intel Advanced Matrix Extensions (Intel AMX) and distributed training with Horovod.

Image Classification with Transfer Learning

To illustrate the power of transfer learning, let’s consider a case study using image classification to identify colorectal cancer. We started with the pretrained ResNet v1.5 weights and fine-tuned the last classification layer using a TensorFlow dataset of 5,000 images with 4,000 set aside for training (Figure 1). This approach allowed us to build on the knowledge acquired during pretraining and achieve close to state-of-the-art accuracy (94.5% for this dataset). Data augmentation was used as a preprocessing step, and early stopping criteria with a patience of ten were employed to stop training once convergence was reached. The pipeline demonstrated run-to-run variations of 6–7 epochs, with an average of 45 epochs to achieve convergence. The advantage of transfer learning lies in its ability to significantly reduce the time and resources needed for training while delivering impressive results.

Figure 1. Visual transfer learning pipeline

Leveraging Intel Xeon Scalable Processors

Training deep learning models has typically been done on GPUs, but we see a paradigm shift with Intel Xeon Scalable CPUs. By utilizing Intel AMX with BF16 precision, we achieved a remarkable accuracy of 94.5% with convergence in just 43 epochs. The entire training process took about five minutes with a single processor, showcasing the speed and efficiency of TensorFlow Optimizations from Intel, which is powered by Intel oneAPI Deep Neural Network Library (oneDNN) and includes convolution, normalization, activation, inner product, and other vectorized operations.

We recommend the following to maximize performance on Intel Xeon Scalable Processors:

Use mixed precision: Leverage Intel AMX BF16 precision format by enabling auto-mixed precision in TensorFlow. BF16 offers better precision than FP16 while maintaining higher performance than FP32. In our case study, we achieved similar accuracy with BF16 compared to FP32.
Use the numactl utility: Accessing memory from the local socket is faster than from a remote socket in NUMA systems. To avoid potential performance issues due to remote memory access, bind the memory to one socket using the numactl command. When hyperthreading is enabled, use the command numactl -C 0–55,112–167 -m 0 python train.py to ensure memory is bound to one socket.
Define run-time parameters: Inter-op parallelism involves distributing tasks across cores to manage system resources efficiently and to improve overall system performance. Intra-op parallelism focuses on optimizing parallel execution within a single core, breaking tasks into smaller sub-tasks to boost performance in single-threaded applications. For this case study, the inter-op parallelism is set to 56 threads (the number of cores) and the intra-op parallelism is set to 56 threads. Additionally, we set the following environment variables:

KMP_SETTINGS = 1
KMP_BLOCKTIME = 1
OMP_NUM_THREADS = NUM_CORES (56)
KMP AFFINITY = granularity=fine,compact,1,0

Empowering Multi-Socket Performance with Distributed Training

Intel Xeon Scalable servers come equipped with two sockets with 56 cores per processor. To maximize performance, we employed distributed training with Horovod and OpenMPI as the backend. Horovod, an open-source distributed training framework developed by Uber, supports popular deep learning frameworks like TensorFlow, PyTorch, and MXNet. By leveraging MPI, Horovod efficiently distributes training data and model parameters across multiple devices, resulting in faster training times. With all 112 cores with hyperthreading enabled, we achieved an impressive training time of around three minutes, comparable to out-of-the-box training on an NVIDIA A100 Rome GPU (Figure 2).

Figure 2. Competitive performance results for the transfer learning workload (hardware and software configuration details below)

Our distributed training setup used weak scaling, maintaining the same batch size throughout. The training is performed using Horovod with two workers on each system, where each worker is mapped to a socket. The dataset is divided into halves and assigned to each worker for processing. To reduce communication overhead, gradients are averaged every five epochs instead of after each epoch. The training process used the Horovod optimizer and a warmup period of three epochs. The initial learning rate was set to 0.01, and it was scaled by the number of workers to 0.002. To optimize intra-op parallelism, the number of threads was set to 54 (the number of cores minus two).

Conclusion

Transfer learning has proven to be a game-changer in AI, enabling us to build on existing knowledge and achieve outstanding results with minimal time and resources. The successful application of transfer learning on Intel Xeon Scalable Processors challenges the GPU-centric training mindset and offers a compelling alternative for high-performance image classification.

Hardware and Software Configuration Details

3rd Gen Intel Xeon Scalable Processor (ICX): Tests performed by Intel on 10/21/2022. 1-node, 2x Intel Xeon Platinum 8380, 40 cores, HT On, Turbo On, total memory 1024 GB (16 slots/64 GB/3200 MHz), SE5C620.86B.01.01.0005.2202160810, 0xd000375, Ubuntu 22.04.1 LTS, 5.15.0–48-generic, n/a, Vision Transfer Learning Pipeline, Intel-tensorflow-avx512 2.10.0, resnet50v1_5, n/a.

4th Gen Intel Xeon Scalable Processor (SPR): Tests performed by Intel on 10/21/2022. 1-node, 2x Intel Xeon Platinum 8480+ , 56 cores, HT On, Turbo On, total memory 1024 GB (16 slots/64 GB/4800 MHz), EGSDREL1.SYS.8612.P03.2208120629, 0x2b000041, Ubuntu 22.04.1 LTS, 5.15.0–48-generic, n/a, Vision Transfer Learning Pipeline, Intel-tensorflow-avx512 2.10.0, resnet50v1_5, n/a.

NVIDIA-A100: Tests performed by Intel on 10/26/2022. 1-node (DGX-A100), 2x AMD EPYC 7742 64-Core Processor, 64 cores, HT On, Turbo On, total memory 1024 GB (16 slots/64GB/3200 MHz), Nvidia A100 GPU, BIOS 1.1, 0x830104d, Ubuntu 20.04.2 LTS, 5.4.0–81-generic, n/a, Vision Transfer Learning Pipeline, TensorFlow 2.10, resnet50v1_5, n/a.