Rapid reads — Optimizing Machine Learning for Edge Devices: Techniques Unveiled

This article is the last of the series on edge computing for ML, if needed more context please refer through the previous articles here

In our previous article, we explored the rationale and significance of edge computing for machine learning. Now, let’s delve into the specific techniques for optimizing models to be edge device-ready.

Quantization

Think of quantization as a memory and computation saving technique by reducing the amount of data each node sees or manipulates. Basically, quantization reduces the ‘quantity’ of data a particular node touches in the model data process.

In more technical terms, Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). — Huggingface

Some might argue that it will reduce the accuracy of the model. Well, yes and no. It would reduce the precision of the model but in the overall picture the accuracy is not hurt that much. And when you look at the overall efficiency of the model, you won’t really see the issue with precision that much : )

Pruning

Pruning is literally what it spells. It means to prune or to ‘cut’ the model where it can be trimmed. Think of this as styling your backyard tree. You don’t need all the branches there, hence you shape it well by pruning some of them. In a similar sense, careful pruning of the machine learning model means to discard or eliminate the nodes which do not improve the model’s performance by being there. You can eliminate them and they won’t hurt the overall accuracy of the model.

The only con to it is that the model might not be able to take some of the outlier cases that it was able to take earlier without pruning. But as mentioned before, the concept of ML optimization for edge devices is for overall efficiency and it is alright to sacrifice a bit of precision in order to do so.

Model Compression

Model compression involves shortening or compressing the model through techniques like knowledge distillation and weight sharing.

  • Knowledge Distillation: This technique transfers essential knowledge from a larger model to a smaller one, reducing computational and memory requirements while maintaining performance.
  • Weight Sharing: By reducing the variety of weights used in a neural network, weight sharing significantly decreases the number of parameters, enhancing model efficiency. Initially introduced in convolutional neural networks (CNNs), weight sharing involves assigning the same weight across a layer, leading to a substantial reduction in parameters.

These techniques enable the implementation of machine learning models optimized for edge computing. In the next article, we will explore data preprocessing pipelining techniques to complement these strategies. Stay tuned for more insights!

--

--