The Challenges of Implementing ML in the IoT Space — Part 2

7 min readMar 2, 2022

This is the second part of a two part series discussing the challenges for Machine Learning (ML) in the (IoT) space. The first part can be found here and should be read first!

The Internet of Things (IoT) space is a growing area for utilising Machine Learning (ML) solutions. When it comes to ML, working in the IoT space presents some unique challenges. We previously defined these challenges in the first part of the series. In this second part of the series, we will explore the data and ML engineering solutions that can be implemented to overcome them.

The Solutions

Like the story of Perseus and Medusa, we can see that lots of small and slippery challenges have a common solution at the source. Unlike in the case of Perseus and Medusa, the solution involves further integration rather than separation. (Source)

For context on the challenges these solutions address, please refer to the previous article.

The three challenges in executing the ML life-cycle in the IoT space we discussed in the previous article.

Getting Data to the Cloud

In order to get data into the cloud whilst maintaining the low profile and minimalism of the embedded tech, live data can be streamed by Bluetooth to a central device that is connected to the internet. This central device caches the data in local storage until it has the opportunity to upload it to the cloud.

Data Labelling

A semi-supervised approach can be taken to labelling. An initial approach such as a convolution-based algorithm designed by a subject matter expert can be used to label the data where each label has a certainty associated. Labels that pass the certainty threshold are committed as labels to train an initial model against. Labels below the threshold are reviewed by a human expert in a labelling tool specifically designed and implemented for this purpose. An example of the tool can be seen in below.

An example of a tool that could be used to label time series signals.

There is initial time investment by the subject matter expert to design the algorithm, but after this point it minimises the demand on the time of the expert for the activity of labelling. It allows much of the data to be labelled automatically whilst highlighting the edge cases to a human expert for review.

This is a risk-based and iterative approach:

The higher the certainty threshold, the safer the auto-labelling, but the more time the expert must devote to labelling and vice versa.
As a better understanding of the data evolves, so should this approach. For example, as models are trained, samples that were just below the model’s threshold for labelling as positive should be reviewed to: a) see if they may actually be positive samples, then b) how this information can be incorporated into the semi-supervised method. Otherwise, you are at risk of high precision, but low recall. That is, every point the model assigns a specific label has a very high probability of being correct (true positive), but it may only capture a very small number of points that should be assigned a label, and so produce lots of false negatives.

Model training & deployment to embedded tech

A managed ML platform like Amazon SageMaker is a flexible and powerful tool that provides the features to create a training and deployment solution that integrates well with the IoT space requirements. Within the platform, the SageMaker pipelines specifically provide the ability to create and manage ML life-cycle workflows, with the flexibility to address the challenges of working with ML in the IoT space whilst also providing the scalability, repeatability and automation expected of a modern ML life-cycle.

a. Matching the environment in training

When deploying trained models, the below are some generalised examples of approaches:

Deploy the model file to a device running a code environment that is the same environment that it was trained in (e.g. Python 3.9, TensorFlow 2.8).
Deploy a Docker image of the model to the device.
Deploy just the trained parameters to a custom implementation of the model specific to the device (e.g. a lightweight C implementation).

Option 1 and 2 are common and applied in many ML deployments. However, option 3 is required in many IoT ML solutions due to the compute limitations producing the requirement to remove any superfluous processing from the prediction serving.

You will notice that the title of this subsection is “Matching the environment in training”, and using option 3 we are not really doing that. In fact, this reveals the issue that the best tools for training at big-data scale and the best tools for deploying at embedded IoT device scale are likely to be too inherently different to achieve the same software environment.

As a result of using different environments for training and inference, many more sources of possible error are introduced. This risk can be minimised (but not eliminated) by more robust processes to ensure the custom implementation of the model with pre-trained parameters applied gives the same results as the trained version with the same parameters. At the very least tests should run a range of inputs through both models and the outputs then compared for equality. Processes should also be put in place to monitor the model in the production environment; metrics such as prediction speed in the resource limited environment are also important in the decision of whether a model is appropriate to deploy.

Lets take the use case of voice-commanded robot-assistants as an example of how this may be implemented.

In this case we may find a Long Short-Term Memory (LSTM) based Deep Neural Network (DNN) is likely an appropriate architecture for the solution as it allows us to handle arbitrarily long time series input with information persistence in both the long and the short term (hence the name). This is especially useful for NLP (Natural Language Processing) models as it means the model can handle variable length sequences (e.g. multiple sentences of multiple words) and consider information from words that have just occurred as well as words or concepts that may have occurred a sentence or two before, but are still crucial to overall understanding.

Matrix algebra representation of an LSTM node. Here `t` represents the time step in the sequence, sigma represents the sigmoid function. (LSTM node diagram adapted from source)

Conceptually an LSTM node is one of the more involved node types that can be found in a DNN. However, when we break it down, we find the operations in each node can be mathematically represented by the matrix algebra in seen above. The only difference between the different nodes is then the trainable parameters; the matrix of weights (W) and the matrix of biases (b). By stripping away the parts of the implementation that are required for a package such as TensorFlow to enable functionality such as back-propagation for training, we are left with just these essential matrix operations for inference. It is then clear how this satisfies the requirement in the IoT space for a lightweight model implementation on a compute-limited embedded device.

b. Version control and automated deployment

Sticking with the LSTM example for a moment we can see how it the matrix implementation can minimise infrastructure requirements for deployment. By needing to deploy only the trained parameters most of the time this minimises the data transferred during deployment. This advantage is significant when deployment is required to potentially millions of IoT devices where minimising data transferred can simplify the infrastructure requirements. Even deploying a whole model with updated architecture with this minimal-functionality implementation is significantly cheaper than a whole model with a package like TensorFlow or PyTorch. Versioning is still critical, however.

Deploying only trained parameters to a lightweight custom implementation of an LSTM based network.

Whatever versioning is implemented should identify what the use-case/context of each IoT device is, along with the current version of parameters and model architecture deployed. This then allows the correct set of new model parameters, or correct new model architecture, to be deployed when required. The models trained in the cloud should also have similar metadata associated; the architecture version they are compatible with and the data context they were trained in.

This is an example where having the right data architecture in place can significantly streamline the process. With this data architecture in place, parameters or model architectures can be deployed by a couple of simple scripts in some sort of lightweight automated process executor (e.g. a Lambda function). An orchestration script and a deployment pipe script. By just supplying the version of the weights to be deployed to the orchestrator, it can check which devices should be updated and trigger other executors to update those devices, then also updating versioning status information in the database if successful.

Final Words

The IoT space provides its own challenges in ML related to the limitations imposed by requiring embedded hardware, high output of unique data, and the possibility of limited or no direct connection to the cloud. We demonstrate how a solution with appropriate architecture can leverage existing ML platforms to implement pragmatic, scalable processes for labelling and deployment to build a fit-for-purpose solution for the complete IoT ML life-cycle. Here we have seen a general approach to the solution, but the specific technologies utilised and approaches taken in any solution will depend on the business context of the IoT product in question.