GPU vs CPU for ML model inference

1 min readNov 29, 2018

GPUs are designed to have high throughput for massively parallelizable workloads. Thus, they are well-suited for deep neural nets which consists of a huge number of operators, each working on some input tensor(s) that can be easily divided into smaller workloads and carried out in parallel, typically resulting in lower latency. In the best scenario, inference on the GPU may now run fast enough and become suitable for real-time applications if it was not before.

GPUs do their computation with 16-bit or 32-bit floating point numbers and do not require quantization for optimal performance unlike the CPUs. If quantization of your neural network was not an option due to lower accuracy caused by lost precision, such concern can be discarded when running deep neural net models on the GPU.

Another benefit that comes with GPU inference is its power efficiency. GPUs carry out the computations in a very efficient and optimized way, so that they consume less power and generate less heat than when the same task is run on the CPUs.

GPU vs CPU for ML model inference

Written by Dharti Dhami