Serverless transformation of IoT data-in-motion with OpenWhisk

Alex Glikson
Apache OpenWhisk
Published in
5 min readJan 24, 2017

--

There are many interesting things one can do with a ‘serverless’ platform such as OpenWhisk. One of the more exciting use-cases is in-flight transformation of data ingested from IoT devices, as part of a larger pipeline involving additional data services. For example, in IBM Bluemix one can build such a pipeline using Watson IoT Platform, Message Hub, Object Storage and the Data Science Experience services (as illustrated in the following diagram).

Example of IoT pipeline in IBM Bluemix

In a nutshell, the role of each of these services is as follows:

  • IBM Watson IoT Platform (or IoTP in short) interacts with the IoT devices and serves as an MQTT broker for ingestion of events sent by devices. Furthermore, it contains a bridge (‘Historical data storage extension’) that can automatically publish messages to Kafka topics hosted on IBM’s Message Hub (e.g., configured per device type and/or event type). The Watson IoT Platform provides many other excellent capabilities (device management, rules engine, visualization, etc), not mentioned in this blog.
  • IBM Message Hub serves as the backbone for exchange of messages between the different cloud services involved in the ingest flow. Moreover, it contains a dedicated “bridge” that can automatically archive batches of messages into Object Storage.
  • IBM Cloud Object Storage is used for long-term storage, as well as for batch analytics with tools such as Apache Spark (e.g., hosted on IBM’s Data Science Experience platform).

So, why do we need OpenWhisk here? In many cases we don’t — the combination of the above services and bridges comprises a powerful and flexible pipeline, spanning from IoT devices up to cloud storage and an analytics platform. However, in some cases the pre-configured bridges lack the programmability required to implement and fine-tune a particular solution architecture. Given the huge variety of possible pipelines and the lack of standardization around data fusion in general and in IoT in particular, there are many cases where the pipeline requires custom data transformation (for format conversion, filtering, augmentation, etc). OpenWhisk is an excellent tool to implement such a transformation, in a ‘serverless’ manner, where the custom logic is hosted on a fully managed and elastic cloud platform.

Let’s look at a specific example.

As mentioned above, the historical data storage extension of the Watson IoT Platform can publish batches of device events to Message Hub topics. It also augments the events with metadata (in JSON format), including timestamps as well as properties identifying the device which sent the event. This metadata is encoded in the message key, while the message body comprises the event payload received from the device, as is. The payload may be in an arbitrary format, potentially compressed and/or binary (e.g., to save uplink bandwidth), often specific to a particular application domain or setup. However, to deliver a general-purpose solution compliant with Hadoop toolchain conventions, the Object Storage bridge in Message Hub requires messages to be in JSON format, following a particular scheme (e.g., single event per message, containing a top-level JSON field comprising the event timestamp). Hence, there is a need to convert the messages between the two formats/schemes.

This conversion can be implemented with OpenWhisk as follows. In a nutshell, the idea is to use two Kafka topics — one for IoTP messages (as published by the bridge between Watson IoT Platform and Message Hub) and one for transformed messages (consumed by the bridge between Message Hub and the Object Storage). The transformation is implemented by a combination of an OpenWhisk trigger associated with a Kafka feed (for the IoTP topic) with an OpenWhisk action that performs the actual transformation, as well as an action that publishes the transformation results to the second Kafka topic (transformed).

Architecture of message transformation with OpenWhisk

In this implementation, the only code the developer needs to write is a simple conversion of incoming messages (consumed from Kafka) from a ‘custom’ format generated by the IoT device (see example below) into a more standardized one, that can be understood by the bridge that organizes the data in object storage in an analytics-ready fashion.

For example, in order to ship sensor data to the cloud more efficiently, the IoT device (or gateway) may aggregate sensor readings in small batches, and compress them before sending over to the IoT Platform over MQTT. Then, on the cloud side, there should be a bridge that would perform the opposite transformation — from a custom, often binary, format, into a standard JSON-based one (later on in the pipeline, this JSON could be converted to a yet another format — such as Parquet). The following figure shows an example of an OpenWhisk action that does just that.

Example OpenWhisk action performing message transformation (format conversion)

That’s it! The only other task is to properly configure and wire the involved services and observe the entire pipeline in action — without the need to maintain a single runtime or VM, with literally just few lines of custom code hosted on OpenWhisk!

To better understand the motivation behind designing serverless message transformation in IoT pipelines, let’s take a closer look at the main data flow steps, spanning the IoT edge and the Cloud, as illustrated in the following figure:

Example of end-to-end data flow, spanning IoT edge and the Cloud, involving OpenWhisk-based transformation

The flow starts from the IoT device, generating pairs of {location,velocity} readings (e.g., collected every second). They are aggregated into short batches (e.g., of one minute), compressed, accompanied with appropriate metadata (such as device details), and sent via MQTT to the Cloud (Watson IoT Platform in this case). The same logic could run in parallel on thousands of devices, so making the ingestion as efficient as possible is important. On the cloud side, events from all the devices (of a particular application/group/etc) are forwarded (in small batches) to Kafka (Message Hub), triggering our OpenWhisk action (which also receives messages in batches). The transformation action is followed by an action writing the result back to another Kafka topic, from where a dedicated bridge consumes the data and writes in (larger) batches to Object Storage, following a pre-configured analytics-ready scheme (e.g., partitioned by event type and by date).

Once the data is stored (and properly organized) in object storage, tools such as Apache Spark can be used for analytics, visualization, etc. Here is a simple example of visualization that one could build with a Jupyter Notebook (also involving just few tens of lines of code), processing data collected with the above pipeline into object storage:

Simple visualization of IoT data with Jupyter Notebook and Apache Spark

The following blog post elaborates on the technical details of building such a pipeline end-to-end in IBM Bluemix, with a specific example that you can run yourself (called ‘TRANSIT’) — check it out!

--

--