Quantizing a network means converting it to use a reduced precision integer representation for the weights and activations (usually int8 compared to floating point implementations).
Advantages of Quantization:
- Reduction in model size.
- Reduction in memory bandwidth.
- Faster inference due to savings in memory bandwidth and faster compute with int8 arithmetic (the exact speed up varies depending on the device, runtime, and the model operators).
When converting from floating point to integer values you are essentially multiplying the floating point value by some scale factor and rounding the result to a whole number.
The various quantization approaches differ in the way they approach determining that scale factor.
What makes it dynamic ?
Static quantization quantizes the weights and activations of the model. It fuses activations into preceding layers where possible. It requires calibration with a representative dataset to determine optimal quantization parameters for activations.
Static Quantization (Post Training Quantization) is typically used when both memory bandwidth and compute savings are important. CNNs is a typical use case.
In dynamic quantization the weights are quantized ahead of time but the activations are dynamically quantized during inference (on the fly). Hence, dynamic.
As mentioned above dynamic quantization have the run-time overhead of quantizing activations on the fly. So, this is beneficial for situations where the model execution time is dominated by memory bandwidth than compute (where the overhead will be added). This is true for LSTM and Transformer type models with small batch size.