How do I reduce Memory footprint of my Machine Learning Model!
DNN inference on embedded platforms
Deep neural networks (DNNs) are successful in many computer vision tasks. Obvious duh! However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. [GOEL]
VGG-16 needs 15 billion operations to perform image classification on a single image.
YOLOv3 performs 39 billion operations to process one image.
To deploy such DNNs on small embedded computers, more optimizations are necessary. Therefore, pursuing low-power improvements in deep learning for efficient inference is worthwhile and is a growing area of research [Alyamkin].
Similar to the image net competition now there are many that are working to identify the best vision solutions that can simultaneously achieve high accuracy in computer vision and energy efficiency. [LPIRC-WEB]
Suppose you want to perform accurate and fast image recognition on edge devices. It will require several steps.
- First, a neural network model needs to be built and trained to identify and classify images.
- Then, the model should run as accurate and fast as possible.
- Most neural networks are trained on floating-point models and usually need to be converted to fixed-point to efficiently run on edge devices
- Keep the power consumption at bay
Fascinatingly there is so much already being done to get us in a position where we will have AI models running in your microwave and what not. State-of-the-art solutions for deploying ML models in resource constrained environments can be classified into following broad categories [GOEL]
- Parameter Quantization and Pruning: Lowers the memory and computation costs by reducing the number of bits used to store the parameters of DNN models.
- Compressed Convolutional Filters and Matrix Factorization: Decomposes large DNN layers into smaller layers to decrease the memory requirement and the number of redundant matrix operations.
- Network Architecture Search: Builds DNNs with different combinations of layers automatically to find a DNN architecture that achieves the desired performance.
- Knowledge Distillation: Trains a compact DNN that mimics the outputs, features, and activations of a more computation-heavy DNN.
Let’s dig deeper into each of these
Parameter Quantization and Pruning
Let’s take an example, the ResNet-50 model with 50 convolutional layers needs over 95MB memory for storage and over 3.8 billion floating number multiplications when processing an image. But after discarding some redundant weights, the network still works as usual but saves more than 75% of parameters and 50% computational time. Techniques being used to do this range from applying k-means scalar quantization to the parameter values to weight sharing and then applying Huffman coding to the quantized weights as well as the codebook [YUCHENG]
Taking this to the extreme are the 1-bit representation of each weight, that is binary weight neural networks. The main idea is to directly learn binary weights or activation during the model training.
THAT’S BINARY WITH A B.
There are several works that directly train CNNs with binary weights, for instance, BinaryConnect [Courbariaux], BinaryNet and XNOR [Rastegari].
Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and power hungry components of the digital implementation of neural networks. Like other dropout schemes, the authors show that BinaryConnect acts as regularizer and they obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.
Compressed Convolutional Filters
These approaches are based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard back-propagation.
Network Architecture Search
Well first thing you would have done trying out any of the deep learning framework like Keras or Torch is specifying the network architecture. And you would agree that it’s mostly arbitrary, at least feels very much so. How about we automate that as well ? We will lose our jobs, sure but would it not be exciting ?
Neural Architecture Search (NAS), the process of automating architecture engineering, is thus a logical next step in automating machine learning. Already by now, NAS methods have outperformed manually designed architectures on some tasks such as image classification (Zoph et al., 2018; Real et al., 2019), object detection (Zoph et al., 2018) or semantic segmentation (Chen et al., 2018). NAS can be seen as subfield of AutoML (Hutter et al., 2019) and has significant overlap with hyperparameter optimization (Feurer and Hutter, 2019) and meta-learning (Vanschoren, 2019). [ELSKEN]
Early incarnations of NAS trained each candidate neural architecture from scratch during the architecture search phase, leading to a surge in computation . ENAS proposes to accelerate the architecture search process using a parameter sharing strategy.
People have tried a lot of different methods from Reinforcement Learning to Evolutionary Algorithms to build these NAS solutions. [Pengzhen Ren]
Surely you would have heard of Auto-Pytorch ? Haven’t ! GO check this out , later obviously.
In knowledge distillation, a small student model is generally supervised by a large teacher model. The main idea is that the student model mimics the teacher model in order to obtain a competitive or even a superior performance.
A question that comes to mind when we talk about knowledge distillation is that how can we compress the model when the room to play with the learnt parameters is limited , it’s hard to see how we can change the form of the model but keep the same knowledge. But to address this, God himself — Geoffery Hinton puts it very succinctly that a more abstract view of the knowledge itself is needed, one that frees it from any particular instantiation and that is that it should be seen as a learned mapping from input vectors to output vectors.
Soft Targets as Regularizers
An obvious way to transfer the generalization ability of the teacher model to a student model is to use the class probabilities produced by the teacher model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. When the teacher model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the student model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.[HINTON]
One of their main claims about using soft targets instead of hard targets is that a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target.
Oh Boy, this turned out to be quite a long article. But it’s interesting how various different approaches are being pursed and leading to a very desirable state where we end up not just saving our limited resources but also in turn be able to deploy our models in situations that can further accelerate the growth and spread of AI systems.
Let me know what you think about this here
[Alyamkin](Alyamkin, S., Ardi, M., Brighton, A., Berg, A. C., Chen, B., Chen, Y., … Zhuo, S. (2019). Low-Power Computer Vision: Status, Challenges, Opportunities. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1–1. doi:10.1109/jetcas.2019.2911899)