How we brought our IoT platform to production

Niklas Hofer
grandcentrix
Published in
6 min readMar 19, 2019

As developers we face problems every day — some easy to master, others harder, some seemingly impossible. More often than not this classification is based on the technology used to address these issues.

The Challenge

Zap, there it was! In 2018 we faced the task of connecting devices to their end users to enhance their functionality. A purely local connection already had been implemented, so we focused on remote control and notifications through the help of the internet. We are a company centered around IoT after all.

Photo by rawpixel on Pexels

United as expected by the #bestteam, we aim for a holistic process by embracing all parts of the developing process. The synergy of embedded hardware, service-based backend and efficient devops is only possible when all units work together, being adaptable even on short notice.

To become a Hero, one must first master being the Sidekick

Reinforcing our capabilities on the thriving IoT market, we had become a Gold Certified Partner with focus on Microsoft Azure IoT. Because of this enormous technological and organizational advantage, instead of implementing our solution as a monolith solely in-house, we went the architectural route of a service oriented architecture. With the services provided by this partnership, it was an easy procedure to divide our task in several independent segments. Borrowing from famous strategists, we could conquer these segments individually by developing, testing and optimizing them separated from each other.

(To limit the length of this very article, we won’t focus on the embedded part or the end user interface here, but describe the experiences we had in the backend team.)

We have a Plan

Electronic components of our customer’s devices send their data to the Microsoft Azure IoT Hub through a MQTT connection, authenticated by SSL client certificates provided by a PKI that is managed by our customer. A properly configured IoT Hub separates incoming messages by concern and forwards them internally to the Microsoft Azure Event Hub. So for example, seldom sent provisioning messages can be processed separately from frequently delivered telemetry updates. This will make the act of scaling more precise.

In our Elixir powered backend we listen to a selection of these Event Hubs, process them, apply business logic, and store them depending on the requirements. It also provides a REST API for consuming clients, protected by outsourced authentication, and authorized access by previously established and physically proved ownership.

Simulation of the mobile app during development was achieved with a collection of shell scripts, making it easy to test and debug our part of the project without the necessity to wait for the mobile app team to implement functionality for us beforehand. To prepare future use of the data sent by devices, we configured the Event Hub to store telemetry data in the Microsoft Azure Data Lake Store — the so called “Capture”, see below.

Our Azure IoT Infrastructure by Anne Koep-Lehmann

Special care was taken to make sure no private or person-related data was persisted that way to comply with the GDPR. Our backend controls and monitors the devices by sending commands using Direct Methods through the Microsoft Azure IoT Hub.

Deploy (to) the Azure Cloud

Having an easily described and repeatedly published infrastructure, with different staging environments just being parameters, is a high priority for us — that’s why we used Terraform which is a tool to orchestrate (initialize, update, destroy) your services in the cloud.

It is a great and flexible tool and makes it really simple to orchestrate all services needed.

Photo by Chris Leipelt on Unsplash

Terraform — and consequently automation — unfortunately has its limits when used with Azure. We got stuck in an endless loop when we tried to initialize multiple Event Hubs. It seems like the cloud itself is overwhelmed by the number of requests. To overcome this issue, it was helpful to include a dependency chain to force the sequential initialization of our Event Hubs — one after the other. There was also no resource implemented in Terraform to initialize the Azure Device Provisioning Service (DPS) yet. This service is used in our infrastructure to provision the devices. A workaround we used was a null_resource with a local-exec provider together with the Azure Command Line Interface. This workaround was also used to connect the Azure DPS with the Azure IoT Hub and to create an Azure Data Lake Store folder.

Extending on top of this, it is not yet possible to prepare the Azure Data Lake Store for the Capture Feature of the Azure Event Hub using Terraform. The Capture Feature provides the function to directly pipe messages from one Event Hub to the Data Lake Store without intervention by ourselves. These steps still had to be done manually. This also means that Terraform can’t hold the state of the Azure Event Hub’s Capture Feature and therefore, whenever you apply Terraform, again the Capture Feature will be gone and we will have to set it all over again. This and a few other missing features (like enabling the Fallback Route of the IoT Hub (which will come in the next Terraform version)) made us question the whole point about automation.

Optimization

Following the rough estimation given by our customer, between 20k and 80k devices were going to be expected to go online over the range of the coming year. During development, our embedded development team only had one real physical device available. While this enabled us to test the full integration from time to time, real benchmarks were unattainable with a single physical endpoint.

Little could we do except what every developer would have done: We wrote a device simulator. It started out as a collection of Python scripts, but over some weeks we erected an easy to run, standalone Elixir/Phoenix web application with an Elm frontend. The application itself was going to mimic the behavior during the whole life cycle of our physical devices. Some parts we sneak peaked from the already implemented embedded firmware, newer features we developed alongside our backend application. This allowed us to make adjustments to minor details of our protocols designed beforehand, reducing the number of round-trips between our both teams, and let the embedded developers work in peace.

Photo by Fares Nimri on Unsplash

Eventually we were able to simulate the whole life cycle of the device. As first step, the factory setup generates a valid looking device ID and includes this into the certification infrastructure, which was designed identically to the one our customer provided to us. The second phase provisions the device at the Microsoft Azure Device Provisioning Service, which consists of the so called enrollment and registering flows, as comprehensively documented by our Partner.

Establishing a connection to the Microsoft Azure IoT Hub enabled us to send ie. telemetry messages analogous to the ones emitted by the physical devices. Also, it gave us the ability to receive Direct Methods and customize the response to them on the fly.

Fortunately, the above mentioned life cycle was instantaneously successful most of the times, but we had to make sure that the infrastructure could handle the estimated number of devices. We extended the device simulator with the feature of creating and handling multiple devices at once. To avoid confusion with the Microsoft Azure Device Provisioning Service feature named “group enrollment”, we baptized it “swarms”.

On our first attempt, we tried to register 1.000 devices at once, but only 20–100 requests were successful — the remainder was dropped with a throttling error. We solved this conundrum by distributing the registration to several machines, each running our device simulator behind different IP addresses, enhancing the warm feeling of cooperation within our team.

Once this mountain was conquered, we could emit messages with a frequency 10 fold the planned one, in order to estimate the impact of the number of devices planned to go live more accurately. After the team effort lead to huge success, we could run our telemetry benchmarks using Stormforger, which revealed the IoT Hub as a bottleneck. After scaling it up by increasing the number of units Azure runs in parallel, our system could handle the load of several thousand machines without any hassle.

Leaving that last hurdle behind us, we are confident to make yet another customer a happy participant in the new and exciting Internet of Things.

Special thanks go to the co-author of the article Anne Koep-Lehmann.

--

--