Model Optimization Strategies

Balaji Kulkarni
MbeddedWithAI
Published in
4 min readMar 28, 2021

--

What makes the difference when it comes to running a neural-net on a full-blown server as compared to an 8-bit microcontroller?

Well you might have guessed it right, it's the limited memory, compute capacity, and operational power. All of these have a tight upper limit that needs to be honored for running anything on these tiny devices.

In this series of articles, we will look into how to fit the regular neural-network models on tiny embedded devices such as microcontrollers.

  1. Defining the scope of neural-network model:

Perhaps, this might sound naive but when we look into the standard models like MobileNet they are more capable than what is intended to be achieved on devices such as microcontrollers, such as can recognize up to 1000 classes on trained using ImageNet.

  • We often won’t need to recognize these many different kinds of classes when it comes to micro-controllers-based applications. Defining the application scope would reduce the complexity of model architecture such as less number of layers and also relatively less training data.
  • A similar thought holds for model-accuracy as well. Does the end application require an on-par accuracy with that of the usual neural network model? If it's acceptable to have a lesser target-accuracy, then this as well can contribute to reducing the neural-net model-size.
  • Combining the above two factors it can also be derived by having a smaller network, it's implicit it would have lesser state information (weights/biases) to be stored in the new model architecture.
  • Reducing model-size also implies a shorter download time for updates.

2. Neural Network Optimization Strategies:

As we saw in the other article, model optimization itself is a larger topic that plays a crucial role in shrinking the model size while not compromising on the accuracy or latency factors in most cases. This is a hot topic in Embedded machine learning and there are numerous papers / proven techniques related to optimization. One such method is Quantization.

Quantization refers to the process of mapping input values from a large set to output values in a smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes
source -https://en.wikipedia.org/wiki/Quantization_(signal_processing)

To complement with an example, consider if the weights of the neural network model are represented in terms of floating-points, representing the same weights by using integers on a much smaller scale (-127 to 127) corresponds to Quantization.

(source — https://blog.tensorflow.org/2020/04/quantization-aware-training-with-tensorflow-model-optimization-toolkit.html)

Two main categories of Quantization for faster inference fall in below buckets:

  1. Post training Quantization: This is an easier to use technique where model can be quantized for weights and/or weights and activations as well requiring minimum amount of data, to reduce the model size and improve the latency with minor drop in accuracy numbers.
  • Weight Only Quantization: Reducing the precision of weights in the neural network from float to 8-bit is a simple approach in this category and does not require any validation data. For CPU execution, 8 bit quantization is preferred and 16 bit for GPU execution environment.
  • Quantizing weights and activations: Difference as compared with above is the activations as well are quantized here along with weights and hence calibration/representative data is required and has to calibrate dynamic ranges of activations. This is also known as “Full integer quantization”.

Converting the whole model to nearest 8bit fixed point integer results in a much smaller model and faster inference speeds on devices such as micro-controller [1]

Source- https://arxiv.org/pdf/1806.08342.pdf

2. Quantization Aware Training: In this type of scheme, quantization is done during training and can provide higher accuracies than post-training quantization techniques. This is achieved by simulating quantized operations during forward pass on weights and activations, and the quantization error introduces noise during training which model learns to minimize and becomes more robuts with Quantization in picture.

The trained model can further be used for performing inference post converting to the supported format (such as .tflite for TensorFlow models).

P.S: This is only applicable if training is a possible option here.

Source- https://arxiv.org/pdf/1806.08342.pdf

Further to above, there are other device specific optimizations that needs to be factored in depending upon the type of the hardware being selected as well. This is because the optimizations available on one platform might be not available on other, though this gap is slowly reducing with major HW vendors supporting deep learning accelerations.

More on optimization techniques in upcoming articles!

Please feel free to share your feedback for improvisation and 👏 if the article was informative!

--

--

Balaji Kulkarni
MbeddedWithAI

An Embedded Software Engineer by profession and an AI enthusiast who enjoys exploring AI use-cases into the Embedded Systems space | Avid Reader | Hodophile