#Tech in 5 — Real-Time Architecture

featuring InfluxDB, Snowflake, and your favorite Cloud Storage

Kieran Healey
Hashmap, an NTT DATA Company
7 min readFeb 20, 2020

--

Building a Data F-1 Car….Vroom!

Zoom! A Formula 1 car races around the track; data streams from IoT devices on board to tablets at the same rate. Velocity, G-Force, Turn Angle, Transmission, and Oil Temperature are all updated to the millisecond. These metrics power human ingenuity and drive optimization that lets F-1 teams know how to eek that last bit of performance out of their car. Real-Time Analysis allows us to make more insightful decisions around optimizations at the edge. This is what drives competitive advantages, not only in F-1, but across other industries as well. Knowing the outcome before your opponent, taking a turn earlier to shave off a tenth of a second; these are examples of how and why businesses across a variety of sectors are investing heavily in Data & Analytics.

Unfortunately, many companies do not have the resources or developers to take on such a project. The regrettable truth is that most don’t know where to begin.

In this #Techin5, we will be discussing InfluxDB: A real-time NoSQL database that supports streaming and real-time calculations. This database allows for fast read and writes thanks to its unique storage and indexing options. By indexing in a certain way, InfluxDB crushes all other real-time offerings; it allows fast table scans of millions of records in less than a second.

High-Level Technical Overview

InfluxDB is first and foremost a NoSQL Database. Data is stored in a semi-structured format and can be loaded through HTTP requests, Kafka, and through the native Python connections. However, where it differs from other NoSQL stores is in how it actually stores the data. It requires a different level of thinking that other NoSQL databases do not have.

InfluxDB utilizes three key fields that make the read and writes so fast: Tags, Measurements, and Series. These are the most important concepts when working with InfluxDB and time-series data as they have been indexed to help for the fast retrieval of data.

Before I go too deep into the details, however, I would like to touch on what the architecture would look like here. In an idealized data world, you would create an idealized data platform that can handle the present, while also future-proofing. You would create reusable components that could be interchanged in the future that allows for upgrades to your Data Platform that doesn’t require demonstrable rework. I am not arguing that rework isn’t necessary occasionally; it is essential to make sure each piece in your Data Platform is running at peak performance, much like an F-1 Car.

A Modern Real-Time Data Platform Architecture

I am using the four primary sources here; your situation might have more, but it’s important to note most of the data an engineer deals with daily will fall into one of these buckets. Streaming sources are streamed into Influx using their Python package. Essentially, this is a wrapper for the requests package as they use API requests to communicate from the source to the Influx sink. To go over the entire architecture in detail would take longer than the spirit of #TechIn5 allows. If there is enough interest I will address the other stages in another article in more detail.

To summarize, this architecture utilizes Snowflake as the central data warehouse where it receives roll ups of the Real-Time Data from InfluxDB on regular intervals. This condenses the data to allow for optimal storage and doesn’t overwork or tax Snowflake, leaving the compute warehouse for BI and Data Science queries.

After the retention policy time set by the developer expires, real-time data from InfluxDB will roll off completely; Snowflake acts as the historian for this data. Traditional BI and Data Science can utilize Snowflake as their source.

Once Data is deemed as ‘cold’ in Snowflake, a monitoring side-car service would roll the ‘cold’ data from Snowflake into Blob as External Tables. Then once not used at all, it would be rolled off to Blob. That way if the data is needed again, it can be reloaded back to External Tables and even into an Internal Table if the need requires.

Coding Time: How do you Influx with the best?

Now back to the deeper dive into Influx. If you have read my previous articles, y’all know how much I love to create reusable connection classes; so, here is the code to spin up your own InfluxDB and how to connect with Python:

docker run -p 8086:8086 \
-v $PWD:/var/lib/influxdb \
influxdb

Simple docker.sh file for you to play with that will allow you to spin up a local InfluxDB on port 8086. After you run this you will need to initialize a connection to the database. Here is the code for that:

from influxdb import InfluxDBClientclass ConnectInflux:def __init__(self):self.client= Nonedef connect(host='localhost', 'port=8086):self.client= InfluxDBClient(host,port) # cnxn to influxreturn self.clientself.client.create_database('twitter_data') # creates a database

This code initializes a connection through the InfluxDBClient; here I’m using my defaults as I am running this locally on my computer at home. If you were, however, using a version inside a Docker container in the cloud or if you were using InfluxDB Cloud, you would exchange the host and port out for the appropriate connections parameters.

Next, you will need some data to load into InfluxDB’s native format. Ideally, you would have a data stream available to grab some data from. I will load some data from Twitter as this is a nice streaming source that is loading close to real-time data. There is a whole instruction set up on how to create a set for Twitter here.

import twitter
api = twitter.Api(consumer_key=[consumer key],
consumer_secret=[consumer secret],
access_token_key=[access token],
access_token_secret=[access token secret])

Once we have the connection, we can query the API to grab the data we are interested in. We need to format the data into an acceptable time-series format that InfluxDB accepts. This is the sample format from the InfluxDB documentation that shows the measurements, tags, and fields I was talking about earlier. Measurements are like a table in traditional SQL stores; this helps Influx narrow down the query. Tags are the metadata of the data format; they allow for quick search by these fields using the Influx Query Language. Finally, fields are the actual data values.

[
{
"measurement": "brushEvents",
"tags": {
"user": "Carol",
"brushId": "6c89f539-71c6-490d-a28d-6c5d84c0ee2f"
},
"time": "2018-03-28T8:01:00Z",
"fields": {
"duration": 127

Therefore, we need a way to transform the data from the format Twitter posts into this format. This could be done in several ways since the Twitter API can be grabbed and transformed into Pandas. There is a cool, but also not fast, way to ingest data into InfluxDB. You can convert the data into a Dataframe and then load the Dataframe directly into InfluxDB. Here is how:

from influxdb import DataFrameClient
import pandas as pd
df = pd.DataFrame(data=twitter_data.values),
index=twitter_data.timestamp), columns=['0'])
print("Database Created")
client.create_database(dbname)
print("Write DataFrame")
client.write_points(df, 'demo', protocol=protocol)
print("Write DataFrame with Tags")
client.write_points(df, 'twitter-demo',
{'k1': 'v1', 'k2': 'v2'}, protocol=protocol)
print("Read DataFrame")
client.query("select * from demo")

You Are On Your Way

Congrats! You have now successfully written data from the Twitter API to InfluxDB. This is only the first step in the Architecture of the Present and I hope you tune in for every article further explaining the architecture outlined above.

Small disclaimer: Architecture is hard. In an industry where many play the game of buzzword bingo, it can be hard to find what’s good, what works! Many voices claim “this is the definitive way to do this architecture”. This is not the case. This is what I think a real-time architecture should look like in a more mature system.

If you think you could see yourself implementing this architecture at your company, by all means, go ahead. Remember that here at Hashmap, we help implement solutions like these every day. When you find yourself lost in a sea of data ambiguity, remember our motto: “Build Better, Together!”.

Ready to Accelerate Your Digital Transformation?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

How does Snowflake compare to other data platforms? Our technical experts have implemented over 250 cloud/data projects in the last 3 years and conducted unbiased, detailed analyses across 34 business and technical dimensions, ranking each cloud data platform.

Other Tools and Content For You

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Kieran Healey is a Cloud and Data Engineer with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--

Kieran Healey
Hashmap, an NTT DATA Company

Full Stack Data Guy — likes blogging about new technologies and sharing simple tutorials to explain the tech.