Best Practice for an IoT analytics platform
Originally published at jugsi.blogspot.com.
Best Practice for an IoT analytics platform
As showed in our last book Hands-On Industrial Internet of Things to build an analytics platform we have to consider 7 principles:
- Data Availability
- Consuming Data
- Execution Partitioning
- Time Ordering Principle
- Statefull vs Stateless
- Additional information (such as Asset information)
- Ubiquity of Analytics
Data Availability
On IoT when we speak about “data” we refer to “time-series” or similar. Timeseries feed the platform with different approaches:
- streaming : continuously ingesting of data
- micro-batch processing : every minutes small portion of data
- macro-batch processing : every hour big portion of data
- on demand : when a particular event occurs
- on data changed : only when the data changes the value
Fig. 1: data availability
Implementation
Data streaming is normally implemented using queues: RabbitMQ, Kafka or MQTT.
Micro Batch is normally implemented using a scheduler (eg. using cron expression).
Macro Batch is very simular to Micro Batch, but is leveraging on Big Data technologies.
On demand and on data changed require a mechanism to trigger the execution of the analytics when data is available.
These modalities imply different approaches to consume data.
Consuming Data
On the book Hands-On Industrial Internet of Things we differentiated between:
- Hot Path : data is immediately processed
- Cold Path : data is stored and organised on a low latency database (eg. timeseries database)
- Big Data Path : data is stored on a data lake avoiding any preprocessing (eg. S3, raw data, HDFS, Parquet, Hive, …)
Fig. 2: consuming data
Implementation
Hot Path is normally used for data streaming analytics, such as simple threshold rules and/or anomaly detection. Indeed, these analytics do not require big amount of historical data, but only the last data point. To be honest, Azure has implemented a very smart mechanism called “windowed data processing” see also Hands-On Industrial Internet of Things.
Cold Path is normally used for the analytics requiring small bunch of data (10mins — 5 hours) of a specific equipment (Asset Performance Management), but have to process data with low latency.
Big Data Path is normally applied for analytics working “at fleet level” (eg. comparing performance of different equipments) and do not require pseudo-real-time result. In other words we need to wait that all data from all monitored equipments are available to trigger the execution of the analytic.
These paths imply different ways to execute the analytic in parallel.
Execution Partitioning
Let’s introduce the concept of “asset”. Asset is
“something valuable”
… ok 😏I know it is not very useful.. Let’s introduce the “IoT Asset”
“Asset is the equipment or system we need to monitor to evaluate health, efficiency and performance.”
Assets are normally organised hierarchically:
- Company/Municipality : ACME ltd, Florence
- System : airplane, car, plant, industry, train, truck, house
- Sub-System (optional) : line, train, …
- Equipment: jet engine, wind turbine engine, pump, lube oil
- Signal (or Measure or Tag): furnace temperature, car speed
a more comprehensible example:
- ACME Refinery :: company
- Production Train #1:: system
- Valve #1 :: valve extends equipment
- Valve #2 :: valve extends equipment
- Furnace #1 :: furnace
- Inlet-Temperature :: measure
- …
- Power Generation :: subsystem
- Turbine LT200 :: turbine extends equipment
- Rotation :: measure
- …
- Power Generator #2 :: power extends equipment
- Production Train #2:: system
- …
- Storage :: system
normally analytics work at equipment or system level so we can analyse measures of Valve #1 and Valve #2 in parallel because they work independently. In other words, we can attach the same analytic to valve #1 and valve #2, and this analytic can run in parallel on valve #1 and valve #2
Fig. 3: parallel map
Implementation
This simple assumption simplifies our architecture, because the same analytic can run in parallel over thousands of assets. So that we can identify 3 scenarios:
Independence of assets : in this case we can deploy multiple instance of the same analytic over multiple asset without consequence.
Independence of measures (tags) : similar to asset, but applied to tag. We can deploy multiple instance of the same rule over multiple tags (of the sam asset) without consequence.
Analytics for full fleet : we cannot leverage on Independence of assets/tags. In this case we need to apply another parallel approach such as Big Data Map Reduce.
Time Ordering Principle
The second assumptions is the time ordering: we cannot process a timeseries in a wrong order. An analytic processing the data of second January before first January should raise an wrong alert or a wrong result.
Fig. 4: ordering
Implementation
The orchestrator of an IoT Analytics platform should respect this principle. Big Data platform, for instance, such as Map Reduce Hadoop do not take in consideration this constraint and we need to apply countermeasure. On the contrary, queue can respect the time ordering.
Notice (my personal opinion) : using standard Big Data Platform for IoT should be not the right choice. We can leverage on this platform only for fleet purpose analytics with high latency requirement or for data explorative analysis.
There are few cases when this principle can be ignored, for instance simple rule or stateless analytics.
Stateless vs Stateful
Consider an analytic counting the number of shutdown day by day. This analytic needs to know the status of the previous day to continue his work. In other words a analytics can require to save the status of the previous run.
Fig. 5: stateful
Implementation
To implement a stateful mechanism, we can save the output of the previous run and passing it as another input to the next run.
In other words, we pass the status as an additional information.
Additional Information
Analytics can require to know additional information related to asset, eg. installation data, type of asset, configuration fo the asset, etc. This information is normally known as “asset metadata”.
Implementation
To implement this additional information we can pass as additional input the asset’s metadata acquired from a standard database. An example of asset database is AnchorDB .
Ubiquity of Analytics
Let’s speak about cloud, edge or on-premise. Analytics can run on cloud (leveraging on the maximum computational power), on premise (leveraging on data centre computational power) or on edge (very close to data with low latency). There are some circumstances where we need to deploy analytics on edge/on-premise from the cloud, or to call analytics running on cloud from the edge.
Implementation
The fastest way to achieve this goal is to leverage on container based technologies (docker) and micro-services.
Final platform
Given our 7 principles, we can implement an IoT platform taking in consideration the following components:
Fig. 6: the proposed platform
For instance Azure and AWS allow orchestration using Lambda or Azure Function. On Hands-On Industrial Internet of Things we proposed an example with airflow. GCP, Azure and AWS leverage in streaming analytics to implement simple or complex rule. Azure supports a very interesting feature for windowing (see Chapter 12 of Hands-On Industrial Internet of Things ). GCP, AWS and Azure proposes a Microservices architecture for analytics based on Docker.