Processing Time Series Data in Real-Time with InfluxDB and Structured Streaming

This article focuses on how to utilize a popular open source database “Influxdb” along with spark-structured streaming to process, store and visualize data in real time. Here, we will go in detail over how to set up a single node instance of Influxdb, how to extend the Foreach writer of SPARK to use it to write to Influxdb and what one needs to keep in mind while designing an Influxdb database.

vibhor nigam
Dec 16, 2018 · 6 min read
Image for post
Image for post

In the data world, one of the major trends which people want to see is how a metric progresses with time. This makes managing and handling a time series data (simply meaning where data values are co-dependent on time) a very important aspect of a Data Scientist’s life.

A lot of tools and databases have been developed around this idea of handling time series data in efficiently. During my recent project, I got to explore one such very popular open source database called “Influxdb”, and this post is about how to process real-time data with Influxdb and Spark.

Influxdb

As from the perspective of a definition

InfluxDB is used as a data store for any use case involving large amounts of time-stamped data, including DevOps monitoring, log data, application metrics, IoT sensor data, and real-time analytics.

From the scope of this article, I will not go into the details of how the database works and the algorithms being used by it, the details of which can be found here

In this article, I will focus mainly on installation, writing and reading capacity, writing through the Spark and behavior of influx with the volume of data.

Installation

Influxdb comes in 2 versions as a solution, open source which can be installed only on a single instance and enterprise edition, which is paid and can be installed on a cluster.

For a number of cases, open source edition is pretty useful and fulfills the requirements. A single instance installation of Influxdb is very simple. The steps I followed are different from what has been mentioned in the documentation (which I found a bit tricky to do installation), which are as following:

  1. Download a rpm file of influxdb
  2. Install alien package if not installed with “sudo apt-get install alien”
  3. Get a .deb file from rpm with “alien name.rpm”
  4. install influx with “sudo dpkg -i name.deb”
  5. Start influx server with “sudo influxd” or with “sudo service influx start”

Hardware Sizing Guidelines

Influxdb has been generous enough to provide us with hardware sizing guidelines. The ones for a single instance node are as follows.

Image for post
Image for post

These guidelines are mentioned in much detail at

InfluxDB Basic Concepts

There are some important Influxdb concepts to understand here

1. Measurement: A measurement is loosely equivalent to the concept of a table in relational databases. Measurement is inside which a data is stored and a database can have multiple measurements. A measurement primarily consists of 3 types of columns Time, Tags and Fields

2. Time: A time is nothing but a column tracking timestamp to perform time series operations in a better way. The default is the Influxdb time which is in nanoseconds, however, it can be replaced with event time.

3. Tags: A tag is similar to an indexed column in a relational database. An important point to remember is that relational operations like WHERE, GROUP BY etc, can be performed on a column only if it is marked as a Tag

4. Fields: Fields are the columns on which mathematical operations such as sum, mean, non-negative derivative etc can be performed. However, in recent versions string values can also be stored as a field.

5. Series: A series is the most important concept of Influxdb. A series is a combination of tags, measurement, and retention policy (default of Influxdb). An Influxdb database performance is highly dependent on the number of unique series it contains, which in turn is the cardinality of tags x no. of measurement x retention policy

It is imperative to decide judiciously on which values to store as tags and which to store as fields as they are necessary for determining the kind of operations which can be performed and performance of the database itself.

Writing Data From Spark

Spark is the most popular and efficient open source tool in the field of big data processing at the moment. There are at present two open source implementation of InfluxDb sink available for writing data through structured streaming, chronicler and reactive-influx.

Both of these are efficient, the only problem with chronicler is to write data through chronicler one has to first convert/create an influx data point into influxdb line protocol, which sometimes becomes tricky to do with a large number of fields and string values. It is for this reason alone that I preferred reactive-influx.

To include reactive-influx in an sbt project just do

libraryDependencies ++= Seq(
"com.pygmalios" % "reactiveinflux-spark_2.11" % "1.4.0.10.0.5.1",
"com.typesafe.netty" % "netty-http-pipelining" % "1.1.4"
)

Make an entry into application.conf

reactiveinflux {
url = "localhost:8086/"
spark {
batchSize = 1000 // No of records to be send in each batch
}
}

To enable a spark-streaming query to write into Influxdb one needs to extend the Foreach writer available in Spark Structured Streaming. A pseudo-code for which is given below

import com.pygmalios.reactiveinflux._
import com.pygmalios.reactiveinflux.spark._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.joda.time.DateTime
import com.pygmalios.reactiveinflux.{ReactiveInfluxConfig, ReactiveInfluxDbName}
import com.pygmalios.reactiveinflux.sync.{SyncReactiveInflux, SyncReactiveInfluxDb}
import scala.concurrent.duration._
class influxDBSink(dbName: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {

var db:SyncReactiveInfluxDb = _
implicit val awaitAtMost = 1.second
// Define the database connection here
def open(partitionId: Long, version: Long): Boolean = {
val syncReactiveInflux =
SyncReactiveInflux(ReactiveInfluxConfig(None))
db = syncReactiveInflux.database(dbName);
db.create() // create the database

true
}
// Write the process logic, and database commit code here
def process(value: org.apache.spark.sql.Row): Unit = {
val point = Point(
time = time, // system or event time
measurement = "measurement1",
tags = Map(
"t1" -> "A",
"t2" -> "B"
),
fields = Map(
"f1" -> 10.3, // BigDecimal field
"f2" -> "x", // String field
"f3" -> -1, // Long field
"f4" -> true) // Boolean field
)

db.write(point)
}
// Close connection here
def close(errorOrNull: Throwable): Unit = {
}
}

and then include it in the writer as follows

val influxWriter = new influxDBSink("dbName")val influxQuery = ifIndicatorData
.writeStream
.foreach(influxWriter)
.outputMode("append")
.start()

Visualization

Once data is stored visualizations using various tools such as Grafana, Chronograph etc can be drawn. A sample visualization will be something like this.

Image for post
Image for post
influxdata.com

There are many articles available on Medium and other platforms as well regarding the visualizations, hence I am not touching it in detail.

Conclusion

In conclusion, I found Influxdb to be highly efficient in data storage and very easy to use. The compaction algorithms of Influxdb are very powerful and compress data to almost half of it. In my data itself, I have seen compression resulting in a reduction from around 67GB to 35GB.

However, what exactly will determine the scale and effect of compression is outside the scope of this article.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

vibhor nigam

Written by

Data Scientist | Comcast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

vibhor nigam

Written by

Data Scientist | Comcast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store