Micro Services pipeline for an Industrial Internet of Things (IIOT) Framework… towards Data Engineering

Published in

Coinmonks

10 min readJul 15, 2018

This is my first ever post on Medium to share an excerpt from my Master Thesis; how Micro-Services Architecture helped me in developing a Framework for Industrial Internet of Things in a time frame of 6 months.

Its also with an intent to write for the students who aspire for their Thesis (or final year project), for the new developers eager to learn the flow of how multiple small applications can play a bigger part in the enterprise level application and for all my experienced/skilled audience who would give feedback on how I could have achieve brilliance in my Thesis Implementation. This Thesis is a part of my Masters Degree at RWTH Aachen University, Germany 🇩🇪.

“The idea was to leverage the technology followed in the Internet of Things and overlap them with the 4th generation industrial revolution that is Industry 4.0.”

By now almost everyone have heard the word Micro-Services; small application/service dedicated to perform some particular task independent of any others software or service. Its not something too complicated for a newbie but just a concept of loosely coupled (almost no dependency) software services combined together resulting in a one complete Application.

Micro-Services Architecture (sometimes also referred as Service Oriented Architecture) is a set of services summing up to a complex application. It’s with a concept that if any part of the Application goes down, it should not effect other features and functionality around it. Sounds reliable and robust, right?

Although there’s no formal definition of Micro Services Architecture (MSA) but there are certain characteristics which identifies this claim. Martin Fowler describes Micro-Services as; Software Applications with scalability, business capability, automated deployment, intelligence in the endpoints and decentralized control of languages and data.

*alternative of Micro Services Architecture were Monolithic Application, built on single programming stack and each component highly dependent on one another.

Combining all: Software applications which are independently deploy-able, modular in nature, running a unique process with a well-defined scope, possess lightweight mechanism to serve a business goal and are loosely coupled with high cohesion ends up being a part of Micro Services Architecture. Such architectures are best to handle different programming stack and to make them work together as per the implementation pipeline.

That’s what I did… making a pipeline of several small applications as a building block to my Master Thesis implementation. These implementation of applications were based on Apache Kafka (used for messaging system), Apache Spark (for real-time data streaming and processing), MEAN Stack (Node.js, Angular for data ingestion and visualization), Python for RaspberryPi (along with sensors programming), MySQL (sample data), JAVA Spring (for map-matching) and MongoDB as a Data Lake. Thus, with the total time frame of 6 months in mind it was suitable for me to choose Micro-Services Architecture. In this way I was able to get my hands-on to these technologies for the implementation and saved the time for writing the detailed report as per the requirement of Master Thesis.

“Master Thesis are usually with a duration of 6 months. In those 6 months every student have to develop a POC / MVP (prove of concept / minimum viable product) along with the research paper/work to defend in front of the University Professors (Supervisors). In that way it can be a part of some research project or it may become the base of some industrial project.”

Now limiting the scope to the break down of master thesis implementation that includes different technology stacks and how did I end up leveraging them with Micro Services Architecture in the time span of 6 months?

At the time of my Thesis Proposal, there were two main concerns raised by University Supervisors. First was concerned with the variety of involved technologies and second concern was the feasibility of implementation with in a time frame of 6 months. The answer to the second concern is already cleared in the prior para and regarding the first concern, yes the variety was meant to be for the sake of implementation. As the goal was to develop “An Extensible Framework for Sensor Data with the context of Industrial Internet of Things”. Thus, I needed a variety of Sensor to make them work as per my pipeline with Node and Python, similarly to handle such variety of incoming data from sensors I needed the messaging system which should be fast and robust enough to handle multiple streams at any point of time so I chose Kafka. Likewise the incoming data was to be processed on the go with Spark’s Streaming Engine and then the data is dumped in a document based database which was MongoDB.

So when you have this variety of technology you can’t rely on a Monolithic Architecture (monolithic applications are built as a single autonomous unit, where change in one component effects the application as a whole). Hence, micro services architecture (MSA) to the rescue! MSA is with a concept that if any part of the Application goes down, it should not effect other features and functionality around it. I molded it a bit and made a pipeline of such micro applications (services) for my Master Thesis implementation.

Context of Industrial Internet of Things

The idea was to leverage the technology followed in the Internet of Things and overlap that with the 4th generation industrial revolution that is Industry 4.0. We all are aware of the idea behind Internet of Things (IOT), that each and every electronic devices are connected to each other in a meaningful way to make our lives easier; providing convenience in our day to day life. Similarly, Industry 4.0 is with an idea of automation. To make the factories totally based on robot; that is less human involvement will lead to less human-error and less things to worry for. Simple example would be the manufacturing industry, you supply the raw material and you get the finished product with limited or no human involvement. Hence, the future factories.

Future Cities and now Future Factories

The concept was to develop such a framework that is flexible enough to adapt as per the nature of incoming data. In the context of industrial internet of things, these incoming stream of data are mostly from the Machine Sensor and Industrial Robot. These data then processed and utilize with my framework on the go. Resulting in meaningful data which could become input of later forwarding machinery in the pipeline of that particular industry.

In the actual implementation, this data which got ingested to the data lake is actually an input to the next component of the Industry Pipeline.

Framework handles the incoming stream of data from several different sources, analyze it on the go and ingest the meaningful information to the data lake. Where, the incoming data are from the multiple giant sized industrial robots working on different communication protocols generating data in different data types and formats at a very high velocity. The communication protocols variate from OPC-UA to the traditional web-sockets, whereas data formats were in variation from low level binaries to JSON and XML. Here’s another flow diagram expressing the architecture with the context of how data flows from the Industrial Robots to the Data Lake.

This explains the part “Franework for Sensor Data” but what about “Extensible”? The resulting framework could perform the above mentioned task in the context of Industrial Internet of Things but was also robust enough that it could handle the incoming data from other domains; with an idea of plug-and-play with the slight configuration without altering or updating any component of the Application.

To defend the extensibility of the Framework I also developed a POC based on Car-to-X application that could help in monitoring traffic congestions at real time involving Map-Matching.

Phase # 1 — Wrappers for the incoming Data from Machines

As industrial systems are based on heterogeneous data sources, we need such software components that are dynamic enough to adapt the change which might occur. Functionality that starts from message queuing system and involvement of streaming engine will remain more-or less the same but the source of data will change as per the change in incoming data or associated protocols of the sensor. Initially, the robots I got to work on were targeting the Web Sockets, TCP Clients and OPC-UA Clients. They can be configured as per the input expected from the sensors via their associated XML resource file which help them to efficiently communicate with the ecosystem built for this thesis. The Wrapper application which I wrote for these input handles data via configuration files, with which one can control the flow of data that initiates from sensors, robots and industrial machines.

Phase # 2 — Message Queuing System with Apache Kafka

An API based on Apache Kafka, a distributed messaging framework was developed and deployed on one of the University Cloud (got a dedicated Cloud Server for the implementation of Messaging System, thanks to my Professor 😃). The target functionality of Apache Kafka was to handle the generated data from multiple sensors. I chose Apache Kafka because of its functionality of replaying data and its implementation of topic subscriptions. Kafka’s topic subscription itself acted as a framework for subscription and separate producers and consumers were handling the incoming data on the basis of different sources. In this manner using Apache Kafka as a fault tolerance and low memory consumption system was a great deal in reducing the load prior to the Streaming Engine. Here’s a flow diagram representing the role Kafka was playing in my Master Thesis. Other options would have been Apache Flume but yeah Kafka wins!

Phase # 3 — Real Time Streaming Engine

Apache Spark is used as a streaming engine to process the incoming stream of data in real-time. This real-time processing exploits the data and makes an ease for further analysis purposes. Apache Spark comes with lots of support such as windowing techniques, sliding steps and data stream mining. Apache Kafka can also act as an Streaming Engine but we choose Apache Spark as it can be extended when the involvement of Machine Learning is required, Spark MLib can play a major role in real time prediction. Apache Spark provides such an awesome eco-system which neither Apache Kafka nor any other streaming engine provides. This includes the native support for dynamic query language, libraries for graph and machine learning libraries.

Just like the variety of data collecting sources, Apache Kafka and this real time streaming engine implementation were also a stand alone project with out any dependency on other components in the pipeline of my Thesis. Spark’s implementation was Scala based, can be configured and executed to any system with Java Virtual Machine.

Phase # 4 — Ingestion to the Data Lake

The process that initiated from the machines generating sensor data, ends up in the Data Lake. For my thesis, a document based Database MongoDB acted as a data lake for the later consumption of the collected meaningful data. There were several reason why I choose MongoDB but I believe those are not meant to be explained in this excerpt. However, apart from MongoDB I had a choice to go for either Apache Cassandra or Apache Hbase.

In addition to this ingestion of data to its final location, I wrote a small MEAN Stack Application (MongoDB, Express, Angular, Node) to visually represent the data going towards the data lake.

Visualized MongoDB was a simple application and was an alternative to use MongoDB Console which represents textual format of the Documents in the Collections of particular MongoDB Database. With this application we can avoid the textual command on console to check the data in our database. This alternative helped me monitor the inserting of each and every set of data which if not implemented would have to executed manually on a very frequent basis to monitor the ingestion of data.

This MEAN stack application wasn’t a part of my initial proposal to Master Thesis but an extra effort to show my dedication… got acknowledged by my Professor on this small effort.

Connected Car, a Map Matching Implementation

Last but not the least, to justify the extensibility of my Master Thesis I came up with the use-case based on Connected Cars. Without any change to my actual implementation, I just switched the source of data. I swapped the incoming data from machines with those generated by the sensors in the Car and I start getting the location of each car at real time. This additional implementation of Map Matching Algorithm with in my Spark Streaming Engine let me to prove the extensibility of my framework by detecting the traffic congestion on the road. Here’s a visual representation in the form of GIF Image.

“Map matching is the problem of how to match recorded geographic coordinates to a logical model of the real world, typically using some form of Geographic Information System.” via Wikipedia

That would be it, hope this helps to anyone who reads.
Clap to appreciate. One, Two or lots of Claps. Thank you 😃

You can find/follow me @ok_ansari. Cheers, Omer Kalim Ansari