TinyML models — what happens behind the scenes

Published in

Marionete

8 min readSep 23, 2021

How can a Deep Learning model be shrunk enough to fit in a small microprocessor?

If you’ve ever worked with Machine Learning, you probably know how complex and heavy these models are to train, using tens of GBs of data, being processed for hours or even days in huge cloud data centers. The result is a model file that can easily reach hundreds of Mbs. So how to deploy these Machine Learning models on mobile or even smaller embedded devices?

Increasingly, Deep Learning applications are moving into more resource-constrained environments, from smartphones to agricultural sensors and medical instruments. This shift into resource-constrained environments led to efforts for smaller and more efficient model architectures as well as increased emphasis on model optimisation techniques.
https://blog.tensorflow.org/

An embedded system is a combination of computer hardware and software designed for a specific function. These devices are very cheap, with the downside of having low processing power, RAM, and storage space. So how can we make it possible for a Machine Learning model to run on such a tiny and limited device?

Credits: HarvardX TinyML course. Smallest processor (ARM powered MCU) Kinetis Kl03 with a surface area of 3.2mm2

(yes, that is a golf ball in the background!)

To understand how we can shrink a machine learning model, first, we need to understand how the model file is composed. A Deep Learning model is composed of layers of neurons with connections between them. Each neuron contains a ‘bias’ value, and each connection contains a ‘weight’ value. In the training phase, the model tries to find the best values for weights and biases to minimize the loss function and achieve better results.

Image from Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic

I won’t go deeper on how a Neural Network (NN) is trained since it is not the scope of this article, but you can find a great explanation here. Long story short, a big slice of the model size is composed of arrays containing the weights of the connections of each layer across the Neural Network.

Let’s assume we want to train a model for image classification using a Convolutional Neural Network (CNN). As we can see in the below example, a CNN can easily contain 2.6 million parameters. We can also observe that the weights’ values have decimals, so their data type is ‘float’. Why is this so important?

A float32 variable occupies 32bits (4bytes) of memory. If we have to store all these values in our model file, it takes 9.9Mb just for storing the weights. Although it is considered a small model for Machine Learning, how can we fit it inside a microcontroller with less than 256 Kb of RAM?

In order to shrink a regular Deep Learning model to a tiny model to fit in an embedded device, there are several frameworks available:

AIMET from Qualcomm
TensorFlow Lite from Google
CoreML from Apple
PyTorch Mobile from Facebook

To further explain these shrinking techniques I’ll use TensorFlow’s framework. In my personal opinion, it’s the one with better documentation, more community support, has strong integration with Arduino and Sparkfun microcontrollers, and is the one I feel most comfortable using :)

Tensorflow developed a package calledtensorflow_model_optimization. This toolkit is a python package with a suite of techniques for developers to optimize their models for either latency, size, or a balance of the previous two. Each technique has its own API built on top of Keras which makes it incredibly straightforward to use.

Shrinking Techniques

When we talk about model optimisation or size reduction, we’re mainly talking about:

Reducing the number of weights
Reducing the number of bits per weight

Let’s take a deeper look into the main techniques to achieve this!

Compression or Distillation

After the model has been trained, it can be compressed with a very small accuracy decrease. The two most common compression techniques are Pruning and Knowledge Distillation.

Pruning — removes the connections (weights) in the NN below a certain threshold since these provide low or no utility to the output and may even lead to overfitting. The sparse Neural Network needs to be retrained after this process.

Knowledge Distillation — often neural networks have a lot of meaningful connections and others that are redundant. The information in the final model that has been trained is transferred to a smaller network model with fewer parameters. This compresses part of the network knowledge into a smaller network allowing to reduce the size of the model.

Quantisation

It is the process of transforming an ML model into an equivalent representation that uses parameters and computations at a lower precision. This improves the model’s execution performance and efficiency. As said previously, each of the weights is originally stored as a 32-bits floating point. Quantisation squeezes a small range of floating-point values into a fixed number of information buckets (converts from continuous to discrete).

Quantisation Aware Training (QAT)— this technique applies quantisation in the training process. The core idea is that QAT simulates low-precision inference-time computation in the forward pass of the training process. This introduces the quantisation error as noise during the training process and as part of the overall loss, which the optimisation algorithm tries to minimize. Hence, the model learns parameters that are more robust to quantisation.

Post-Training Quantisation (PTQ)— takes an already-trained neural network and quantizes it to reduce the 32 bits values to 8-bit signed integers (i.e. `int8`). By leveraging this quantisation scheme, we don’t need to retrain the model with quantisation-aware training. This not only reduces the size of the model in 4x but also makes it compatible with most of the microcontrollers since most of them have 8 bit arithmetic for performance optimisation.

A benchmark was run on some very well-known mobile models and the results show a slight decrease in accuracy as a trade-off for a 4x smaller model with up to 2x faster inference speed. Also, note that the quantisation during training is a little bit more efficient but much more complex to use.

Weight Clustering

Weight clustering or weight sharing reduces the number of parameters of the Neural Network to be stored. It groups the weights of each layer into N clusters and then shares the index of the cluster centroid for all the weights.

For instance, if we choose to cluster a given layer into 8 clusters, each of the 32-bit weights will be replaced by the 3 bits (8 = 2^3) index of the centroid.

An experiment was run on several models and datasets to test the benefits of the weight clustering technique. Below see the results of this experiment that managed to achieve a 6x reduction at the cost of 0.6% accuracy.

The cluster of the centroids can be fine-tuned to reduce the error. They can either be first initialized randomly, linearly (through the weights values range) or density-based.

Although at a first look the density-based initialisation might look a better approach since it contains a lower variance in each cluster, the linear approach showed better results. According to the Deep Compression paper, because large weights play a more important role, by using linear distribution these weights have a better chance of forming a large centroid.

Encoding

The model is compressed using Huffman Encoding principles. This is an optional step and it simply encodes the weights with a binary code being the most frequent weights assigned to the codes with fewer bits. Let’s observe the following example:

We have the following list of already quantized weights:

Each one of these weights occupies 5 bits (ranging from 0 to 31). If we use Huffman Encoding to encode the weights based on the frequency, we get the following ‘tree’:

Each of the purple boxes represents the ‘weights’ and these will be encoded with the ‘binary path’. For instance, 17 is encoded with ‘11’, 22 is encoded with ‘001’, and so on.

As we can observe, the weights were reduced from 5 bits to 2/3 bits. The model will then decode the codes into the original weights.

Compilation

Most microcontrollers run in C or C++ as it’s much more memory efficient and faster than Python. In this step, the model is compiled into a format that can be interpreted and executed by interpreters like Neural Networks Interpreter present on some Android devices and TF Lite Micro Interpreter present in some microcontrollers.

Summary

These are the most used techniques in TensorFlow Lite for model shrinking. For more information, you can visit the roadmap and stay up to date with the latest news!

In the next article, we’ll go through a step-by-step practical guide on building a TF model, converting it to a tinyML model using the above techniques, and fitting it into a microcontroller for local inference.

Note: One thing I found interesting is that according to a PitchBook study on Emerging Spaces, although TinyML is in a great position in ‘Deal Counts’, it is one of the areas getting less capital invested.

Source: https://www.plugandplaytechcenter.com/resources/tinyml-making-smart-devices-tinier-ever/

In my opinion, this makes the TinyML field a great opportunity for tech companies to start exploring since it will for sure have a huge hype in the coming years.

References:

TensorFlow Model Optimization

The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing ML models for deployment and execution…

www.tensorflow.org

Tiny Machine Learning: The Next AI Revolution

The bigger model is not always the better model

towardsdatascience.com