What You Need to Know About Real Time Architectures for IoT / Time Series Problem Sets

by Chris Herrera

https://source.unsplash.com

In previous posts, I discussed that not a single time series database or product rules all solutions, however, there are some common patterns that are reusable to allow a time series solution to scale to the problem set…with one key caveat — integration.

It would be a bit presumptuous to propose a single technology such as Phoenix/HBase or InfluxDB as the be-all and end-all for every time series use case, because as I discussed previously, they are not. The problem set is too broad.

Instead, what I would like to explore is leveraging architectural patterns and practices for time series. In this and the next few posts, I will focus on the lambda and kappa (event sourcing) architectures, specifically optimizing them for time series. These patterns use a combination of technology that allow them to scale to the need of the problem, but additionally provide a level of flexibility and protection that one product on its own would have a tough time replicating.

The Lambda Architecture

Originally proposed by Nathan Marz and James Warren in Big Data: Principles and best practices of scalable real-time data systems, the Lambda Architecture focuses on three main components: the speed layer, the batch layer, and the serving layer.

A blog post does not do this architecture justice, so I ask that you go and check out Marz and Warren’s book or look at http://lambda-architecture.net/, a collection of good resources on the topic. However, I will attempt to give you a summary view and potential implementation technologies to consider for the various layers.

Where Do I Use a Lambda (or Kappa) Architecture?

The Lambda Architecture attempts to define a solution for a wide number of use cases that need…

1. Low latency reads and updates

2. Machine fault tolerance and human fault tolerance

Further, a multitude of industry use cases are well suited to a real time, event-sourcing architecture — some examples are below:

Utilities — smart meters and smart grid — a single smart meter with data being sent at 15 minute intervals will generate 400MB of data per year — for a utility with 1M customers, that is 400TB of data a year

Oil & Gasreal time drilling, connected wells, smart refineries

Healthcarewearables and patient tracking (internet of patients)

Manufacturingsmart factories

Communicationsconsumer behavior and new service deployment

A Quick Primer on the Architectural Layers for the Lambda Architecture

The Batch Layer is an immutable, append only store (because it is fun to say the same thing 2 different ways). From this immutable data source, arbitrary views are computed as the result of a MapReduce process, and as new data is added, it will be picked up and processed in the next iteration of the MapReduce process.

Now, if this all sounds a bit fraught with latency, not to worry, that is where the Speed Layer comes in.

The Speed (Streaming) Layer computes views from the real-time data stream, with the major caveat that it is only computing the views for the data received between now and the last run of the MapReduce job.

The idea behind the speed layer only containing the delta of information between real time and the last MapReduce job is what Marz and Warren call “complexity isolation” meaning that you are pushing most of the complexity to the layer that only produces temporary results.

The Serving Layer is responsible for exposing the views that were created (both batch and real time views) so that they can service incoming queries for data.

Now that we have that quick introduction to the Lambda Architecture behind us, let’s look at some key technologies that would be involved in such a system.

New Data

For Time Series and IoT, it makes sense to ingest that data via NiFi (collection, curation, transmission) into Kafka (distributed messaging and streaming) and have both the batch and speed layers read from Kafka.

Batch Layer

Hadoop and Hive generally serve as a standard choice for the batch layer. This is the least complex layer as it is append only and it typically runs MapReduce jobs to generate pre-defined views on the data.

However, Spark should be considered a strong possibility for this layer, but potentially costly for some time series data loads. I will talk more about why Spark should be considered below.

Speed (Streaming) Layer

This is where many a debate can be had, and often is. We will also explore this layer in much greater detail in future posts.

For the work being done on time series data, and specifically, IoT and sensor telemetry type data, Spark is well suited to the task due to the various facets that exist in the Spark project.

The flexibility of Spark, namely, Spark Core, Spark Streaming, and Spark SQL work well across both the batch and speed/streaming layer, not just in one or the other.

Flink, with its dual batch and streaming capabilities, also works well across layers and is starting to get strong consideration in a variety of use cases. Storm is another good choice, depending on the latency that can be tolerated, however, has a bit of a downside that we will discuss.

Serving Layer

Druid, a high performance, OLAP-capable data store is an attractive choice for serving as it can easily handle the output from both the speed and batch layers.

However, for time series and the use cases that surround it, Phoenix/HBase make a strong case for the serving layer, as that combination provides a great platform for OLTP and Operational Analytics (where low latency is required) that also integrates with existing BI solutions, tools, and applications.

Considerations

One of the potentially large downsides of the Lambda Architecture is having to develop and maintain two different sets of code for your batch and speed/streaming layers.

That is why Spark and Flink are both so attractive in their ability to cross the layer boundaries. For example, if I were to implement my logic in Hive for batch processing and Storm for real time stream processing, I would not be able to reuse my aggregation logic — this is no fun from a maintainability and development aspect.

In my next post, I will discuss how the Kappa Architecture aims to simplify this problem by transitioning all our efforts to focus on streaming in a more event sourcing pattern.

Feel free to share this blog on other channels and be sure and keep up with all new content from Hashmap at https://medium.com/hashmapinc.

Chris Herrera is a Senior Enterprise Architect at Hashmap working across industries with a group of innovative technologists and domain experts accelerating the value of connected data for the open source community and customers. Feel free to tweet @cherrera2001 and connect with him on LinkedIn at linkedin.com/in/cherrera2001.

--

--