Techniques for faster AI inference throughput with OpenVINO™ on Intel® discrete GPUs

Published in

OpenVINO-toolkit

3 min readJun 29, 2023

Do you want to learn new techniques for faster AI inference throughput? Luckily, using OpenVINO toolkit on Intel discrete GPUs allows you to do just that. Intel’s latest GPUs, including Intel® Data Center GPU Flex Series, and Intel® Arc™ GPU, introduce a range of new hardware features that benefit AI workloads. We’ll summarize two new hardware features: XMX (Xe Matrix Extension) and parallel stream execution.

XMX (Xe Martrix Extension)

XMX is a hardware acceleration for matrix multiplication. XMX technology can provide more multiplication capacity at the same precision. OpenVINO can take advantage of XMX hardware by accelerating int8 and fp16 inference.

You can check whether your GPU hardware (and software stack) supports XMX with OpenVINO™’s hello_query_device sample. When you run the sample application, it lists all detected inference devices along with its properties. You can check for XMX support by looking at the OPTIMIZATION_CAPABILITIES property and checking for the GPU_HW_MATMUL value.

Parallel Execution of Multiple Streams

Another improvement of Intel®’s discrete GPUs is to process multiple compute streams in parallel, which can increase model efficiency. Parallel stream execution can bring significant performance benefits, but only when used appropriately by the application. It will bring good performance gain if the application can run multiple independent inference requests in parallel, whether from single process or multiple processes. On the other hand, if there is no opportunity for parallel execution of multiple inference requests, then there is no gain to be had from multi-stream hardware execution.

Techniques using OpenVINO

There is a growing demand for AI inference in various applications and a need for efficient and high-performance inference solutions. There are various techniques to optimize AI inference using OpenVINO and Intel GPU’s. These include:

Model Quantization: This involves reducing the precision of the model’s weights and activations, which leads to smaller model sizes and faster computations. Using OpenVINO tools can help quantize models effectively.
Model Pruning: This aims to eliminate unnecessary connections and parameters from a model, reducing its size and computation requirements. OpenVINO supports various pruning algorithms to improve inference speed.
Asynchronous Execution: OpenVINO allows for asynchronous execution, enabling concurrent processing of multiple inference requests. This can enhance GPU utilization and improve throughput.
Batch Processing: This processes multiple inputs simultaneously, maximizing GPU parallelism and reducing inference time per input.
Tensor Fusion: OpenVINO employs tensor fusion to merge consecutive layers of a model into a single layer, reducing memory transfers and accelerating computations.

Read the full ‘how to’ explanation on the OpenVINO blog and see performance improvements here.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Techniques for faster AI inference throughput with OpenVINO™ on Intel® discrete GPUs

XMX (Xe Martrix Extension)

Parallel Execution of Multiple Streams

Techniques using OpenVINO

Written by OpenVINO™ toolkit