Using Apache Druid to realtime stream the validation results of business events

Ramya Rengarajan

Published in

Ancestry Product & Technology

6 min readOct 17, 2023

Using Apache Druid to realtime stream the validation results of business events

Understanding the Requirements:

At Ancestry, we record around 800 million plus business events every day, with our north star metric being able to scale up to 1 billion events per day. We have already built our pipelines to validate against our schema, augment and ingest these events to our data warehouse, data lake and third party distributors. We next needed a way to make sure that the business events reached our systems and either passed or failed validation. We wanted the solution to be developer and analyst friendly, such that the developers could automate these validations as part of their CI/CD pipelines and analysts could run queries and get quick answers.

Brainstorming and designing the solution architecture:

While we knew for these validations to be part of CI/CD pipelines we needed a tight SLA of 1 minute or less, but we were wondering how to provide that SLA across a billion events. Brainstorming, we had our eureka moment; we figured that we don’t need to validate all the events, we just need to validate the events triggered by our test users or test devices. This drastically reduced the scale of this validation system and now we were looking at just hundreds of thousands of events every day.

Choosing the technology to implement:

We had an event schema registry and so it was easy for us to quickly decide that we would leverage our existing schema registry to test against the listed users. Considering both types of our end users, we provided API calls for developers and UI for analysts to add/edit/remove a user or device to a test list in the schema registry.

Our existing data ingestion pipelines are made up of a series of java microservices that talk to each other by publishing to various kafka topics and working on kafka streams. So, it was logical for us to choose to publish all the test listed events to a kafka topic. While we were debating on various options to stream the events to a REST endpoint and query it realtime, Apache druid came up as one of our top options. We also did a parallel proof of concept using a few other technologies, but decided to go with the Apache druid solution as it seemed to fit our use case perfectly, and the implementation seemed straightforward.

Implementation and learnings:

We implemented the single-serve deployment and ran Apache druid in our AWS ECS task. Running Apache druid locally and connecting to a datasource can be done literally in a few minutes, which will leave you in awe of the real time streaming and transformation of the kafka messages in no time.

But soon after the honeymoon phase, when we started taking real traffic, we hit some roadblocks, encountering frequent “unhealthy supervisors” issues which halted the ingestion. We got lucky a few times where, just by restarting the tasks, this would be fixed. The newly spun up tasks would ingest, index and work again, but that phase didn’t last long either. Even the new tasks ended up in an unhealthy supervisor state right after trying to index.

Now, time to reflect up on our requirements, look at our POC solution and tweak / fine tune configurations as needed.

Though druid is a data warehouse analytics tool, we were using it only as a validation results streaming tool for our tests, which meant we didn’t really need the data stored for long term at all.
We had “useEarliestOffset“ set to “true”during our ingestion. Changing this to default “false” paid off right away as the new tasks coming up then had to ingest and index only from the latest offsets. “resetOffsetAutomatically” is another useful KafkaSupervisorTuningConfig that can help with auto correcting offsets in case of glitches. There are a handful of other tuning configurations that might really help depending on the use case.
We also changed the retention.ms in kafka topics for these validation topics to just 17800000 ms which is 2 days. So, even if we tried to ingest, index from the beginning, it wouldn’t be more than 2 days worth of test events.

Sample spec:

response=$(curl -s -o /dev/null -w "%{http_code}" -X POST -H 'Content-Type: application/json' -d '
{
"type": "kafka",
"spec": {
"ioConfig": {
"type": "kafka",
"consumerProperties": {
"bootstrap.servers": "'$kafka_server'"
},
"topic": "$topic",
"inputFormat": {
"type": "json"
},
"useEarliestOffset": false,
"taskDuration": "PT720H"
},
"tuningConfig": {
"type": "kafka",
"resetOffsetAutomatically":true
"intermediateHandoffPeriod": "PT1H",
"pushTimeout": 3600000
},
"dataSchema": {
"dataSource": "$data_source",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"granularitySpec": {
"queryGranularity": "none",
"rollup": false,
"segmentGranularity": "day"
},
"dimensionsSpec": {
"dimensionExclusions": []
}
}
}
}' $supervisor)
echo "$response"

Druid Retention rules came in handy to drop off old segments. As one day’s worth of data was good enough for our use case, we added a loadByPeriod — P1D period with includeFuture=true rule followed by the dropForever rule.

add_rules=$(curl -s -o /dev/null -w "%{http_code}" -X POST -H 'Content-Type: application/json' -d '
[{
"type": "loadByPeriod",
"period": "P1D",
"includeFuture": true,
"tieredReplicants": {
"_default_tier": 2
}
},
{
"type": "dropForever"
}
]' $coordinator_rules)
echo "$add_rules"

The rule combination dropBeforeByPeriod + loadForever is same as loadByPeriod(includeFuture = true) + dropForever. There are various combinations of these load and drop rules that would help depending on the specific use case.

Druid only marks the segments dropped off by retention rules as “unused” and we had to schedule a “kill task” periodically to clean up the unused segments from deep storage.

curl -X 'POST' -H 'Content-Type:application/json' -d '{
"type": "kill",
"dataSource": $dataSource,
"interval" : "2023–08–01/2023–10–01"
}' http://localhost:8888/druid/indexer/v1/task
{"task":"kill_datasource_fkancnag_2023–08–01T00:00:00.000Z_2023–10–01T00:00:00.000Z_2023–10–03T22:32:59.359Z"}

The tasks tab shows the various tasks — index, kill, compaction tasks etc and their status.

Compaction is another useful configuration to reduce the storage on unused segments. We tried enabling automatic compaction. When a compaction task was submitted, the status changed to “awaiting compaction” and then “Fully compacted (except the last P1D of data, 7 segments skipped)’. Though compaction was cool and promising, we chose to drop off the old segments instead of compacting them as per our use case.

curl - location - request POST 'http://localhost:8088/druid/coordinator/v1/config/compaction' \
 - header 'Content-Type: application/json' \
 - data-raw '{
"dataSource": "$dataSource",
"granularitySpec": {
"segmentGranularity": "DAY"
}
}'

Druid also provides “query laning” (think of carpool lanes and normal lanes in a freeway) where we can prioritize and separate some queries from the rest and “service tiering” where you can separate quicker, lighter queries from heavier, longer queries. For our use case, all the queries need a quick SLA, so we didn’t implement these.
We added new relic agents on all the five druid apps(broker, router, coordinator-overlord, middle manager, historical) to monitor them. We would like to migrate to cluster mode deployment when we have some bandwidth, as monitoring CPU and memory of 5 java applications in a single server deployment is just not ideal.

Conclusion:

After adding retention rules and deleting old segments, ingesting from the latest offset — druid is working for our use case of streaming the kafka topic events real time, thus enabling our clients to validate both by running queries or using the REST API in seconds after the events are triggered. While using an open source technology like Apache Druid comes handy, it is crucial to go through its vast offerings and pick and choose depending on individual requirements.

Steve Jobs’ quote “Simple can be harder than complex” resonates well as we are designing and implementing solutions for various software requirements and problems every day at Ancestry. Understanding and absorbing requirements, brainstorming solutions, and choosing the right technologies can help deliver the satisfaction and smiles that all developers wish for.

Happy designing and coding!

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info. Also, please see our medium page to see what Ancestry is up to and read more articles like this.

Written by Ramya Rengarajan