Photo by Ryan on Unsplash

Accelerate AI Model Performance on the Alder Lake Platform

Faster AI Inference with Intel Optimization for TensorFlow

Vivek Kumar
4 min readMay 13, 2022

--

Vivek Kumar, AI Software Architect; AG Ramesh, Principal Engineer; Banikumar Maiti, Deep Learning Software Engineer; and Geetanjali Krishna, AI Software Engineering Manager; Intel Corporation

TensorFlow is a widely used deep learning (DL) framework. Intel has been collaborating with Google to optimize TensorFlow for Intel platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.

If you’re using TensorFlow (version 2.8 or later) on a Microsoft Windows desktop or a laptop, you can now get these performance benefits on Intel’s latest Alder Lake platform. In addition, we are adding support of low precision using INT8 quantization and VNNI instructions for neural network calculations.

Intel Optimization for TensorFlow (version 2.8) provides up to 3.2x throughput performance improvements over the stock TensorFlow. Figures 1 and 2 show the performance improvements observed for common image recognition models from the Intel AI Model Zoo. Throughput and latency benchmarks were done on a 12th Gen Intel Core Alder Lake-S CPU (i9–12900K@3.2 GHz, 32GB RAM, no GPU) running Microsoft Windows 11 Pro (version: 10.0.22000). A batch size of 8 and FP32 precision were used for throughput (frames per second) measurements (Figure 1). A batch size of 1 and FP32 precision were used for latency (time per inference in milliseconds) measurements (Figure 2). Lower latency gives better responsiveness.

Figure 1. Throughput comparison for TensorFlow-2.8 vs. Intel-optimized-TF-2.8 on the Alder Lake CPU
Figure 2. Latency comparison for TensorFlow-2.8 vs. Intel-optimized-TF-2.8 on the Alder Lake CPU

Users can now also run INT8 precision models to benefit from low precision optimization using INT8 quantization and VNNI instructions. INT8 models provide additional performance gains in inference throughput (Figure 3). The same batch size (8) was used for both the FP32 and INT8 models.

Figure 3. Throughput improvement of INT8 models over FP32 models using Intel-optimized-TF-2.8 on the Alder Lake CPU

Users can reproduce these benchmarks by getting the models from Intel AI Model Zoo and following the steps below.

First, setup a Python virtual environment. (If Python is not installed, download and setup Python 3.8.x or 3.9.x for Windows from https://www.python.org.) Open a Windows command prompt and install virtualenv:

pip install virtualenv

Next, create and activate a Python virtual environment named venv3_tf:

mkdir C:\venv
cd C:\venv
virtualenv venv3_tf
\venv\venv3_tf \Scripts\activate

At this point, the venv3_tf virtual environment will be activated, as reflected by the command prompt:

(venv3_tf) C:\venv>

Install TensorFlow in the virtual environment:

(venv3_tf) C:\venv> pip install tensorflow==2.8.0

(Optional) Install GNU wget to download the pretrained models from the Windows command prompt.

Initialize the command window with the Intel Windows TensorFlow environment from https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat. Now you’re ready to run the benchmark scripts. Clone or download the Intel AI Model Zoo repository, e.g.:

(venv3_tf) C:\sources> git clone https://github.com/IntelAI/models

Navigate to the benchmarks under the base directory:

(venv3_tf) C:\sources> cd models\benchmarks

Set the PYTHONPATH environment variable in the Windows command prompt to the base directory:

(venv3_tf) C:\sources\models\benchmarks> set PYTHONPATH=.;C:\sources\models\models\common\tensorflow;C:\sources\models\models\image_recognition\tensorflow\resnet50;C:\sources\models\benchmarks

Set the PYTHON_EXE environment variable to the virtual environment in which the Intel Optimization for TensorFlow is installed:

(venv3_tf) C:\sources\models\benchmarks> set PYTHON_EXE=C:\venv\venv3_tf\Scripts\python.exe

To run the ResNet-50 benchmark, download the pretrained FP32 and INT8 models and save them in the benchmarks directory:

(venv3_tf) C:\sources\models\benchmarks> wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb(venv3_tf) C:\sources\models\benchmarks> wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_int8_pretrained_model.pb

Finally, run the ResNet-50 benchmark using a batch sizes of 8 or 1 to optimize throughput or latency, respectively, using the commands below. Specifying --num-cores=-1 lets Intel AI Model Zoo set the correct number of cores automatically.

(venv3_tf) C:\sources\models\benchmarks> python common\tensorflow\run_tf_benchmark.py --framework=tensorflow --use-case=image_recognition --model-name=resnet50 --precision=fp32 --mode=inference --benchmark-dir=C:\sources\models\benchmarks --intelai-models=C:\sources\models\benchmarks\..\models\image_recognition\tensorflow\resnet50 --num-cores=-1 --socket-id=-1 --output-dir=C:\sources\models\benchmarks\common\tensorflow\logs --num-train-steps=1 --benchmark-only --in-graph=resnet50_fp32_pretrained_model.pb --disable-tcmalloc=True --batch-size=8 -v(venv3_tf) C:\sources\models\benchmarks> python common\tensorflow\run_tf_benchmark.py --framework=tensorflow --use-case=image_recognition --model-name=resnet50 --precision=int8 --mode=inference --benchmark-dir=C:\sources\models\benchmarks --intelai-models=C:\sources\models\benchmarks\..\models\image_recognition\tensorflow\resnet50 --num-cores=-1 --socket-id=-1 --output-dir=C:\sources\models\benchmarks\common\tensorflow\logs --num-train-steps=1 --benchmark-only --in-graph=resnet50_int8_pretrained_model.pb --disable-tcmalloc=True --batch-size=8 -v

--

--