Realtime Data in Apache Druid — Choosing the Right Strategy

Why you should use Kafka Indexing instead of Tranquility

Kartik Khare
Towards Data Science
4 min readDec 12, 2019

--

Storing data in real-time data streams has always been a challenge. The solution depends on your use cases. If you want to store data for daily or monthly analytics, you can use a distributed file system and run Hive or Presto on top of it. If you’re going to run some simple real-time analytics, you can store the recent data in Elasticsearch and run Kibana for charts.

Apache Druid was made to address both of the above use cases at the same time. It can serve as a durable data store for daily or monthly analytics. It can also serve as the fast queryable data store, which allows you to push and retrieve data in real-time.

The issue, however, with earlier versions of Apache druid was ingesting the data from streams in the database. Let’s take a look at what challenges the developers faced earlier.

Tranquility

Tranquility is a package provided by Apache Druid to ingest real-time data. Tranquility is not exactly but equivalent to JDBC or Cassandra drivers. It handles partitioning, replication, service discovery, and schema rollover for you. The user needs to be concerned with the data and the data source he needs to use.

Ingesting realtime data using Tranquility

Tranquility addresses a lot of the problems which a user might face. However, it comes bundled with its own set of challenges.

Not Exactly-once

Tranquility can create duplicate records in some instances. It doesn’t provide any exactly-once guarantee. In the case of scenarios, such as timeouts in POST request for data or not receiving the ack, Tranquility can produce duplicate records.

This situation leaves the user with the responsibility of de-duplication the data. It can even lead to incorrect graphs in Apache superset if you are using one.

Data drops

The most concerning issue with Tranquility is the data drop. There are various circumstances under which Tranquility intentionally or due to an error fails to insert the data. Some of these cases listed in the official documentation are —

  • Events with timestamps outside your configured windowPeriod will be dropped.
  • If you suffer more Druid Middle Managers failures than your configured replicas count, some partially indexed data may be lost.
  • If there is a persistent issue that prevents communication with the Druid indexing service, and retry policies are exhausted during that period, or the period lasts longer than your windowPeriod, some events will be dropped.
  • If there is an issue that prevents Tranquility from receiving an acknowledgment from the indexing service, it will retry the batch, which can lead to duplicated events.

The worst part is in most of these cases, you wouldn’t even know that your data has been dropped till the time you query.

Error Handling

Since Tranquility daemon runs inside your JVM, it’s the responsibility of the application to handle errors such as timeouts. In the case of applications such as Apache Flink, not managing one of these errors effectively, can lead to unwanted restarts.

On top of all these issues, Tranquility is built for druid 0.9.2. Using it with current druid version 0.16.0 can create unidentified problems.

Kafka Indexer

To address all of the above problems, Apache druid added Kafka Indexer in version 0.9.1. The indexer remained in the experimental state until version 0.14.

Kafka Indexing Service first starts a supervisor according to the config specified by you. The supervisor then periodically starts new indexing tasks, which are responsible for consuming data from Kafka and publishing it in Druid.

Unlike Tranquility, Kafka Indexing Tasks can be long-running. Multiple segments can be published after a minimum no. of rows or bytes have reached without starting a new task.

Kafka Indexer in Apache Druid

Kafka Indexer aims to solve various challenges that its predecessors faced.

Exactly-Once Semantics

The Kafka indexer provides an exactly-once guarantee to the user. This guarantee is possible since the default Kafka consumer since Kafka 0.11.x provides out of the box support for this semantics.

Publish delayed data

Kafka indexer is designed to publish delayed data. It’s not subjected to the windowPeriod considerations of Tranquility. This ability gives a user the freedom to backfill data from a particular offset in Kafka to Druid.

Schema Updates

Although Tranquility also supports schema updates, it’s much easier to do in Kafka Indexer. You just need to submit a POST request containing the new schema and the supervisor will spawn new tasks with the updated schema. You don’t require any changes on the producer end.
In case you add new columns, the older rows will show empty values in those columns, but the rows will still be queryable.

Kafka Indexer services solved a lot of issues that the developers faced when they were using Tranquility. If you want to get started with the Kafka Indexing service, you can refer to the Apache Kafka Ingestion in official Druid documentation.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Kartik Khare
Kartik Khare

Written by Kartik Khare

Software Engineer @StarTree | Previously @WalmartLabs, @Olacabs | Committer @ApachePinot

No responses yet