Geek Culture
Published in

Geek Culture

Decoding Efficient Deep Learning- Path to Smaller, Faster, and Better Models

In today’s tech savvy world, it is famously said that predicting the future isn’t magic, it’s called Artificial Intelligence, and Data is essentially the new science behind it! The field of Machine Learning and Artificial Intelligence at large is evolving at a tremendous pace. Every researcher is trying to achieve the best models and beat benchmarks continuously.

In the real world, when a model is deployed, a lot of focus has to be spent on analyzing whether deep learning models can be efficiently scaled for people who might not have millions of dollars to train the models and gigantic machines to deploy their models. Deep learning and Artificial Intelligence have empowered humans to essentially find a needle in a haystack, however let’s dive deeper into how this science can be made more accessible by making it more efficient.

The general trend of Deep Learning is that we can always get better performance if we are able to train larger models, given we have a lot of data. This is typically true for only one neural architecture. For instance, for ResNet, we can get better performance if we can get deeper new networks. This is similar to an inception model. Natural language Processing follows a similar trend. Deep Learning’s rise to prominence is often attributed to the ImageNet competition held in the year 2012, where models like AlexNet (named after the lead developer Alex Krizhevsky), performed 41% better than the next best submission. This led to a race to create better and more powerful neural networks with higher number of parameters and complexity.

Comparison of various CNN architectures (Photo Credit)

Deep Learning coupled with neural networks has been one of the most talked about and dominant technologies in the field of natural language understanding, speech recognition, computer vision, information retrieval and more. With the continuous improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all increased significantly. As a result of this, it has become essential to pay attention to model metrics and not just its quality.

This article aims to provide an overview of Efficient Deep Learning techniques surveyed in the research paper Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.

What are the challenges being faced while training or deploying a model?

A deep learning practitioner typically faces the following challenges on a day-to-day basis:

  1. Sustainable Server-Side Scaling: Although training could be a one-time cost and could be even free if one is using a pre-trained model, training and deploying large deep learning models is usually expensive. This is because once the model is deployed, one needs to run inference on it for a long period of time, resulting in consumption of server-side RAM, CPU, etc..
  2. Enabling On-Device Deployment: Some of the deep learning applications require to run real time on IoT and smart devices where the inference happens directly on the device for reasons like privacy, connectivity and responsiveness. This is why deep learning models need to be optimized for larger target devices.
  3. Privacy & Data Sensitivity: There are times when the user data might be sensitive and not easily accessible. In these cases, one needs to learn to be able to use as little data as possible for efficient training of models.
  4. New Applications: Another challenge that deep learning practitioners face is that some newly built applications offer model quality or footprint related constraints that existing off-the-shelf models might not be able to solve.
  5. Explosion of Models: One might need to train and/or deploy multiple models for different applications to achieve the desired accuracy and performance. However, working with multiple models might end up exhausting the available resources.

A Mental Model

Today’s technology has pushed intelligence beyond the cloud and to the edge devices, thanks to increased computational power, efficient hardware and overall growth in technical prowess of each component involved. However, to enable Deep learning on an edge device is more complex because of the more computational complexities involved such as memory size on a microcontroller can only be limited to a few megabytes . To be able to have something like on device-AI on a cell phone or a microcontroller, we need more efficient Deep learning models. For a broader perspective, let’s start with building a mental model of the approach to efficient deep learning, an approach to efficient deep learning that is smaller, faster and better.

A mental model for thinking about algorithms, techniques, and tools related to efficiency in Deep Learning (Photo Credit)
A mental model for thinking about algorithms, techniques, and tools related to efficiency in Deep Learning (Photo Credit)

Below explained are multiple focus areas of model efficiency and optimization in the landscape of Efficient Deep Learning:

Compression Techniques: In compression techniques, we look at an algorithm from the view of optimizing its architecture. Essentially, the goal for such techniques would be to see whether a big model can be converted to a small model , with the end goal being to be deployed on an edge device. A great example of compression techniques would be quantization. It basically involves reducing the weight metric of a layer by reducing its precision. It should be noted that the compression doesn’t blindly convert the numbers by rounding of a floating point to the nearest integer, but there is an algorithm that is applied to ensure there is no loss of quality. Benefits of quantization clearly outline better performance and enabling deep learning models on edge devices.

Illustration of pruning weights (connections) and neurons (nodes) in a neural network (Photo Credit)

Learning Techniques: Learning techniques can be thought of as a replacement to traditional supervised learning algorithms. These techniques typically try to train a model differently to achieve better quality metric accuracy, F1 score, precision, recall, etc., without impacting the inference in any way. The objective is to achieve the same baseline quality with a smaller model, even if it means a trade off between quality and the number of parameters / layers in the model. To make it more efficient, smaller models or student models are made to match a large model or teacher model. Knowledge is transferred from teacher model to student models by minimizing a loss function. A larger data model with more parameters can be used to label data for a lower capacity model. This can really help increase the inference speed, reducing storage size and making it more accessible for lesser computational accessibility.

Distillation of a smaller student model from a larger pre-trained teacher model (Photo Credit)

Automation: Another method for optimization is tuning hyper parameters to improve accuracy, which could then be exchanged for a model with lesser parameters. For example, if we are trying to classify a dataset to detect an iris flower, where based on the petals, sepal width and length the dataset is trying to predict what type of flower it is. Even after selecting the most optimal model for prediction, there are so many parameters to choose from. To optimize the model further we can use hyper tuning, the process of choosing the most optimal parameters after selecting our model. The train-test-split method can be used to initialize the parameters and observe the score/accuracy for each parameter. The score can then be used to exchange the larger model for a model with lesser parameters and similar accuracy. Win-Win!

Efficient Architectures: Fundamentally, these are basic building blocks that are designed from scratch such as convolutional layers and attention layers, that are a significant leap over the baseline methods used before. An example of this method would be convolutional layers that introduced parameter sharing for use in image classification, avoiding the need to have a separate weight for each input pixel. These architectures can directly improve the efficiency gains. It takes the effort of going back to the drawing board with some insights from the baseline model and designing layers and models that are more efficient by design, all with the goal of enabling deep learning to be smaller, faster and better!

Infrastructure: In order to run and train deep learning inference efficiently, there has to be a robust combination of good software and hardware. There are two sections to it, the first being Model training and secondly Model Inference. For the scope of this survey let’s look at the Model Inference part as the goal here is making deep learning efficient for edge devices. Let’s look at a comprehensive survey of the components that make an efficient infrastructure for critical model efficiency. At model inference, the inference framework consists of Tensorflow, PyTorch or Tensorflow Lite, Pytorch Mobile depending on whether the inference is on server side or on-device. The low level optimization libs consist of XLA, Glow and Tensor comprehension.

A visualization of hardware and software infrastructure with emphasis on efficiency (Photo Credit)

The hardware such as GPU and TPU can be used for speeding up linear algebra operations. GPUs these days have been standardized for efficient deep learning models and Nvidia has come up with several iterations of their GPUs with increased focus on deep learning. They have also introduced Tensor cores that are specifically designed for efficient deep learning applications. In Tensor cores, the core speedup comes from doing the expensive matrix-multiplication at a lower precision, which makes deep learning more efficient.
TPUs designed by Google, are more focused on ML applications and are specifically designed to accelerate deep learning applications with Tensorflow. The core architecture of TPUs uses the Systolic Array design. In a Systolic architecture a large computation is split across a mesh- like design. Each cell in the mesh computes a partial result and passes on to the next cell in order with each clock. The advantage of this type of a design is that since there is no need to access the results of intermediate registers, once the required data is fetched the entire computation is not memory bound.

The figure below outlines a Systolic mesh topology implementing a MAC operation where A is fed horizontally into the array and B is pushed vertically with each clock. The resulting a(ij) and b(jk) are passed onto the next clock cell with each clock tick.

Systolic Arrays in TPUs (Photo Credit)

A Practitioner’s guide to Efficiency

So far we have seen the tools and techniques, but does one use this to implement Efficient Deep Learning? Based on our understanding so far, we know that the key to achieving efficiency in deep learning models is pareto-optimal models, where we would like to achieve the best possible result in one dimension, while holding the other dimensions constant. Typically, one of these dimensions is Quality, and the other is Footprint. Some of the Quality related metrics could be Accuracy, F1, Precision, Recall, AUC, etc. While some of the Footprint related metrics could include Model Size, Latency, RAM, etc. The relationship between these metrics could be thought of as a trade-off which means, a model with higher size and latency is highly likely to achieve higher accuracy and performance. Similarly, it is possible that a model with lower size and capacity is likely to deliver poor accuracy related metrics. This relationship can be summarized in the below figure:

The trade-off can be achieved in two ways- Grow and Shrink. That is, we could further enhance the quality metrics of a model by exchanging some of the quality for better footprint by compressing the model capacity. This process is called Shrinking. On the other hand, it is also possible to improve model quality by adding more capacity to the model. This process is called Growing.

Trade off between Model Quality and Footprint (Photo Credit)

Therefore, there could be two efficiency strategies towards achieving pareto-optimal models helping us move closer to that target model:

Shrink-and-Improve for Footprint-Sensitive Models: This strategy is followed in cases where one wants to reduce the footprint and keep the quality the same for on-device deployments and server-side model optimization. Shrinking could be achieved via learned compression techniques, architecture search etc. and should ideally result in minimal loss of quality of the model.

Grow-Improve-and-Shrink for Quality-Sensitive Models: This strategy is followed in cases where one wants to deploy models that have better quality while keeping the same footprint. We start by adding capacity to the model and then improving the model using learning techniques, automation, etc. Another advantage of using this strategy is that the mode could be shrunk back directly after growing the model too.

Examples of techniques to use in the Grow, Shrink, and Improve phases (Photo Credit)


In order to demonstrate what we learnt about Efficient Deep Learning, we can conclude that in order to implement efficiency in deep learning based models we need to first achieve a new pareto-frontier using the efficiency techniques discussed in this article. One could use these techniques in isolation or combine multiple techniques depending on the desired outcome. Once we narrow down on our techniques, we demonstrate the tradeoffs for both ‘Shrink-and-Improve’, and ‘Grow-Improve-and-Shrink’ strategies.

In other words, we provide empirical evidence that it is possible to either reduce model capacity to bring down the footprint (shrink) and then recover the model quality that they traded off (improve), or increase the model capacity to improve quality (growing) followed by model compression (shrinking) to improve model footprint.

This article provides useful insights into the vast landscape of Efficient Deep Learning to equip practitioners with the information that is needed to traverse from a sub-optimal model to a model that meets both- the quality and footprint standards.

I would like to thank the authors and the team at Google Research for their tremendous work and research in the area of model efficiency.

I hope this articles inspires you to further explore the area of Efficient Deep Learning and helps you make the right decisions about efficiency while training and deploying your models!

Happy Learning! :)

References: Menghani, G. (2021). Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. arXiv preprint arXiv:2106.08962.




A new tech publication by Start it up (

Recommended from Medium

Automation Toolkit for Machine Learning: A Python Package to Make Machine Learning Journey Smoother

PyTorch Web Service deployment using Azure Machine Learning Service and Azure Web Apps from VS Code

Image classifier to detect crying babies and play Tom and Jerry to make them happy!!

Experiences building a newsfeed out of twitter streams

CTG Classification on Edge Device

Compiling classical ML for (up to 30x) performance gains and hardware portability

Creating Art through Neural Style Transfer

Deep Learning with PyTorch: selecting hyperparameters

Semantic search with vector database

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Somya Mishra

Somya Mishra

MS, Data Science | MBA, Finance

More from Medium

Color — A New Dimension of Depth in Graph Neural Networks

XAI Methods — Guided Backpropagation

Are you ready for Machine-led Machine Learning? MAML: A Modern Approach to Meta-Learning

ML Model Optimization for low compute environments