EEoI for Efficient ML in Edge Computing

Eric Muccino
Mindboard
Published in
6 min readJul 1, 2020

Edge Computing

A recent trend in computer networking is the utilization of edge devices, intermediate computing devices that bridge the gap between end devices and the cloud. End devices (e.g. phones, tablets, smart speakers) are restricted by their lack of computational resources, limiting the employment of resource intensive functions such as machine learning. Cloud computing has provided an alternative way of executing tasks that would otherwise be infeasible on an end device. Devices can offload processes to remote servers, which are capable of faster computation, and return the results back to the end device. However, not all computing tasks benefit from the cloud. Sending data to the cloud can introduce latency caused by network unreliability. If possible, it is more efficient to perform computations locally.

An alternate to the cloud, edge devices are locally positioned servers that can communicate with end devices at lower latency. However, edge devices offer less available resources than the cloud. One basic strategy for edge computing is to perform tasks as close to the place of request as possible. Vertical collaboration does this by restricting communication so that only adjacent devices can offload tasks. For example, if a calculation is requested on a mobile device, the process will first be attempted on the device itself. If the device’s resources are insufficient, the task will be offloaded to an edge node. If the edge node is unable to complete the job, the process will be moved to the cloud, where it can be executed and returned to the end device. More complex computing architectures can split up a process into multiple sub-processes that are executed across each layer in the network hierarchy. Networks may also incorporate multiple edge nodes for parallel computation.

Early Exit of Inference (EEoI)

EEoI is an application of vertical collaboration within an edge network that provides an efficient method for running Neural Network model inference on end devices with scarce resources. A typical neural network uses a series of computational layers to sequentially transform input data. The output of one layer feeds into the next, with the output of the last layer producing the final model output. The number of layers in a model is determined before the model is trained and cannot be adjusted without training an entirely new model. Having more layers in a neural network allows more complex distributions to be modeled, so there need to be enough layers to accommodate the most complex portion of the training data set. However, there is usually a trade off between model accuracy and efficiency. Models with more layers have the capacity to achieve higher inference accuracy over more complex data, but also have a larger memory footprint and have slower inference speed.

A drawback of the typical feed-forward network architecture is that the model size is fixed. Every inference call uses the entire stack of layers to produce an output. As a result, the complexity requirements of many neural network models make them incapable of directly servicing end devices. Larger models are relegated to edge or cloud devices, increasing model inference latency. However, most data sources contain samples of varying levels of complexity. Within a data set, some samples are more well behaved, strictly adhering to a primary distribution. On the other hand, other samples will contain more unique elements or noise. We will use this fact to design a neural network architecture that only uses as many layers as it needs to for inferring any given sample. To do this, we will implement Early Exit on Inference.

EEoI incorporates multiple output layers, at various layer depths within a neural network. Each output layer is learning to accurately produce classification predictions based only upon the information that has been processed within the layers that have come before it. Output layers that are deeper in the network have the opportunity to learn more complex distributions, leading to better inference accuracy than the preceding output layers. During inference, each successive output layer is observed. Inference continues through the proceeding layers until an output layer makes a prediction with a sufficiently high probability, or until the final output layer is reached. The prediction probability threshold for an EEoI is a pre-set hyper-parameter, with a trade-off between overall inference accuracy and inference speed. A higher probability threshold produces more accurate, but slower inference, on average.

Applying EEoI to Edge Computing

EEoI can be applied to edge computing, allowing a model to be separated across multiple devices. Initial layers can be placed on an end device, giving the device the chance to perform inference on well-behaved samples. If the desired output confidence is not achieved, the end device can offload the last layer of its portion of the model to an edge device. The last layer of the model is effectively an encoded version of the original input sample. This provides the added benefit of smaller data size for more efficient offloading. The edge device will continue the inference processes, offloading to the cloud if need be.

Let’s take a look at an example of EEoI on the MNIST digits data set in Keras (TF 2.1). First, we will load the data set.

Next, we create a classification model in 3 portions, one for each device. Each device has a classification prediction output layer. The end and edge device models also need offloading output layers that supply the encoded data representations that will be offloaded to the next device if need be.

Putting all 3 models together, we have a complete model that we can train. The target labels need to be supplied to all 3 output layers.

Now we are ready to train the full model. The EarlyStopping callback is used to terminate training before the model over-fits the test data.

With our trained model, we can test the performance of EEoI by running inference over the test data set with each device model. Using different confidence thresholds, we can observe the total accuracy and we can see how often each device is used for inference.

Results

Just as we had hoped, increasing the confidence threshold improves our overall accuracy in exchange for increased utilization of edge and cloud devices.

Now just for fun, let’s take a look at some of the compressed samples that are offloaded to the edge and cloud devices.

As a sample is offloaded from end to edge and edge to cloud, it is compressed and encoded.

Pixel representations of a single sample, offloaded from end device to edge and cloud devices.
Pixel representations of a single sample, offloaded from end device to edge and cloud devices.

Conclusion

EEoI provides an elegant method for efficiently determining when to offload neural network inference to edge and cloud devices. EEoI is one of many innovative techniques that can help us to better utilize the edge for machine learning.

References

Masala.AI

The Mindboard Data Science Team explores cutting-edge technologies in innovative ways to provide original solutions, including the Masala.AI product line. Masala provides media content rating services such as vRate, a browser extension that detects and blocks mature content with custom sensitivity settings. The vRate browser extension is available for download via the Chrome Web Store. Check out www.masala.ai for more info.

--

--