Processing IoT realtime data

5 min readOct 5, 2016

With Go, Apache Spark, AWS Kinesis and a lot of bottles of Wine.

Hi! My name is Vinicius Souza and I’m a Brazilian Software Engineer. After some weeks of vacation I received an invite from an old friend to work on Chile at his startup.

So, after some skype calls, I get a flight to Santiago, Chile.

Plaza de Baquedano — Santiago, Chile.

The challenge?

Process real time data from grandparents.

Let me explain: A Brazilian startup called LinCare running it’s project on Chile, developed a wristband to collect body data and monitor grandparents, their sleep, their daily activity and their accidents. Integrating with your children or close parents by push notifications, sms or phone calls.

Lincare: Cuidado pra quem ama e pra quem se ama!

Um acessório que aproxima familiares e idosos em um laço de proteção e carinho. Cuidado para quem ama e para quem se…

www.lincare.com.br

The mobile app collects data every 5 seconds from the IoT device by low energy bluetooth connection and send this data to an API. In fact, it’s here when the process start for me.

Let’s put cards on the table!

There is no secret here: mobile app sends data to an API, this API collect activities and sanitize data to put on stream. After this, the stream consumer (called “processor” by me) will handle this data, apply some rules, business logic, A.I. or something else. So, in the end consolidate processed data on data warehouse.

The stack is made by an entry API , three Spark Jobs, S3 for raw data storage and DynamoDB for storing user sensitive data.

The API: Golang

I’m a Go developer, but there is not much doubt here. Some months ago I read this article: http://marcio.io/2015/07/handling-1-million-requests-per-minute-with-golang/ and after searching a lot about this, I implemented a simple and powerful API to receive data and stream it.

So, after the mobile app sends data to the entry API this data will be staying on AWS Kinesis. At this moment, I have a powerful entry point in Golang, and an auto scale streamming. No problem to dealing with a million of requests a day.

The Stream: AWS Kinesis

https://aws.amazon.com/kinesis/streams/?nc1=h_ls

“Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.”

So, the mobile app sends data to API and streams it. Kinesis has a retention period (7 days by default), it’s fault tolerant and easily scalable.

The processor: Apache Spark on AWS EMR

“Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.”

With Apache Spark we can consume AWS Kinesis stream and process data. This is an amazing and powerful tool.

Spark it’s an amazing data processing engine. It supports Scala, Java, Python an R. Accordingly to my knowledge I used Python API.

Simple configuration, nothing new ahead.

PS: AWS Kinesis has a ShardIterator, which it’s a property to handle the way you’ll be consuming data from the stream.

LATEST: Last 24 hours (Kinesis store data for 7 day by default)

TRIM_HORIZON: Since when stream was created.

AT_TIMESTAMP: No need more details needed, this is just a specific moment to get data.

Remember, Apache Spark Streamming API runs on loop, so every time that you fetch data from tunnel, you have to save a checkpoint, fortunately AWS SDK make this for us, but unfortunately doing this on AWS DynamoDB (more costs).

You need to create an SparkContext and a Stream. On `py_rdd` we receive data from Kinesis, in this case, JSON Data.

With `py_rdd` data we’ll be to create dataframes, SQL Statements with Spark SQL API. By default you need to set an output for data.

Amazon EMR supports Apache Spark, or if you want, you can run standalone in any server.

A simple tutorial to run Spark on EMR: http://spark.apache.org/news/run-spark-and-shark-on-amazon-emr.html

RAW DATA: AWS S3

It’s interesting that you save raw data, for security reasons.

So, simpler than this, is asking too much:

The relational database: Amazon Redshift

“Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.”

The processed data we have stored on Redshift is now available to third-party apps.

Spark hasn’t a native support to Redshift, but I’ve found this amazing project on Github: https://github.com/databricks/spark-redshift

This is the simple implementation to integrate Spark and Redshift:

After your product login you need to set an basic commands to Spark start your Kinesis consumer, and wait seconds to next interaction.

It’s a simple process, no big deal, right?

But it’s a very powerful process, after this you can run a universe of algorithms on your pipeline. Some references:

Collaborative Filtering - RDD-based API - Spark 2.0.1 Documentation

Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries…

spark.apache.org

Using Spark SQL for ETL

Ben Snively is a Solutions Architect with AWS With big data, you deal with many different formats and large volumes of…

blogs.aws.amazon.com

Beginners Guide: Apache Spark Machine Learning with Large Data

This informative tutorial walks us through using Spark's machine learning capabilities and Scala to train a logistic…

www.kdnuggets.com

That’s all folks. ❤