Data Science for edge deployment is different!

The idea of Data science deals with a lot of concepts, from data gathering to model deployment. Based on the type of deployment, the entire process changes a lot.

To be clear, there are two high-level types of deployment.

Server Level and Edge Level

Assuming the article readers are aware of the steps involved in the server-level deployment cycle, we will try to expand on the new aspects which will come during the edge model development cycle.

Data

This is similar to any other data pipeline you might work on if you had to develop a deep learning model. The only thing to concentrate on is the low-resolution imagery.

If your original source of the image is high resolution, then you have to understand, if your object of interest will be visible at all after you hard-resize the imagery to low resolution. the decision on how small the image should be taken based on the visibility of your object of interest in the image after the resize.

Modeling

This is similar to any other data pipeline you might work on if you had to develop a deep learning model. The only thing to concentrate on is the lite-weight base model you can get for your task.

You can, take this in a two-step format, the first one being, just use a state-of-the-art model for your task and see if it works out in terms of your expectations of accuracy. As it is impractical to run any state of the art model on edge devices because of the heaviness they carry through weights and inference timings, try replicating the same experiment with lite-weight architectures, with a bit of low-resolution imagery.

The reasons to follow up the 2-step format are

  • If you directly jump on to try out with lite-weight based model, in case if it doesn’t work out, it gets difficult to understand what is going wrong, as there will be too many new variables you might be dealing with.
  • You will have a metric to evaluate how much you are compromising when compared to the state-of-the-art methods at the server level
  • The technique of making the lite-wight model perform as good as the heavy one with knowledge distillation, I will write a separate blog post explaining this in the next few days.

Model Conversion

This is a new step in the process for people who have worked on server level deployment so far. As the deployment happens on the edge device and our regular deep learning libraries(Tensorflow, Pytorch) which we have worked on during development will not be compatible with the edge architectures, So, we end up using liter versions of the same libraries (Tflite, coreml …). Generally, the conversion involves multiple steps based on the requirement.

Inhere, we will discuss two primary mandatory steps:

Reserialisation

The idea here is, the trained models in regular deep learning libraries might end up having a different serialization format, which isn’t compatible with the edge libraries. For example., Incase of TensorFlow, the development library uses protobuf format to save weights and graphs whereas the Tflite version expects the weights and graph to be in flatbuffer format. Most of the standard libraries have the tools for this kind of conversion which will work out of the box.

Optimizations

This step is mandatory based on your trade-off between space, accuracy, and processing speed. There are different types of optimizations and everything is to reduce the inference time by the trade of minimal accuracy metric. There are various innovative techniques people use, we will just give a headline touching on three of the common practices.

  • Quantization

Generally, when we train with TensorFlow or any other major library, the weights will be saved in float, which consumes a good amount of space. The idea of quantization is to convert these float values into integer type, so that the model size gets reduced by 4 times, with minimal loss of accuracy. Most of the primary libraries have tools for this kind of optimized conversion of the trained model.

I found Peter warden’s articles around the same very helpful at that point in time, I would suggest everyone skim through before trying to use the existing things in the market now.

https://petewarden.com/2017/06/22/what-ive-learned-about-neural-network-quantization/

A few months back, even TensorFlow officially released the supporting codes for the same

https://www.tensorflow.org/performance/quantization

Apple’s Core ML already started supporting 16-bit precision instead of 32-bit precision for the same reason.

https://developer.apple.com/documentation/coreml/reducing_the_size_of_your_core_ml_app

  • Pruning

Pruning is the process of skipping out a few neurons in the network which might not be that important for the task. This can be done in multiple ways, during training or with a certain set of validation images to understand which neuron is adding value to the final softmax.

https://jacobgil.github.io/deeplearning/pruning-deep-learning

https://arxiv.org/pdf/1611.06440.pdf

  • Knowledge Distillation Training

The concept here is, you pick the best model which works for you in terms of accuracy and use that model’s output to train your smaller base model such that it ends up performing as well as the high-end trained model. I will be writing a clear blog post around the same in the coming days. In the meanwhile, you can follow the link below to understand it well.

https://arxiv.org/pdf/1611.06440.pdf

Model Verification

This is an important step in the entire process, it is more of a testing step to make sure everything is still the same or not. After the model conversion, a lot of things are changed, either it could be weights or the format of serialization, or the way the graph itself is structured.

To understand whether our converted model still performs similarly to the trained model, this step is an important aspect. For this, one has to create a framework that replicates the converted model inference and the trained model inference, then infer through few images and get the output comparative analysis based on some distance metric to understand the difference between the two models.

Model Profiling

During the process of Edge deployment, there comes a time, where one wants to understand, how the model performs on a specific device, in terms of timings, memory, and various other things.

This step becomes very important, as this will determine the type of model optimization you need, the image resolution which will work for you, which base model to consider, and a lot of other things.

As an example, if your model is too big to load, you end up doing quantization and change the base network which is a lot lite. If you see it is too slow, the techniques like pruning will come into the picture apart from the architectural changes you might end up doing.

To understand it in a more practical way, you can follow my blog post which I have published a year back on how to do profiling with Tflite on Android.

https://heartbeat.fritz.ai/profiling-tensorflow-lite-models-for-android-a2bc53199682

Model Deployment

The final step of the process is deployment. It has no major complicated steps, things are as simple as using the converted model with respect to the library you are going to use on the edge.

In terms of modeling, it does look simple, but, one has to take care of implementing the preprocessing steps required for the image before passing for inference. There could be other complications if you work on Livestream inference, have to handle a lot of things, like., skip frames, streaming pixel buffer format, and other minor things which are difficult to figure out until and unless you face them.

These opinions are completely based on my experimentations and experiences from the last few years, might not even work for you. So, please feel free to object, if the content deviates from the facts, I will be more than happy to edit it according to the latest standards.

--

--