Fast Approximate Counting Using Druid and DataSketch

Supporting dynamic criteria across multiple big data sets

Image depicting a traveler segment
Photo by Jirsak on Adobe Stock
  • All the travelers who are searching for lodging in Paris, France in the next 2 weeks but have not booked yet
  • Travelers who booked a flight to a particular destination in the next month but not booked any lodging to that destination

Druid and DataSketches to the Rescue

For time-series data, the Druid database was already being used at Expedia Group, so Druid was a good starting point. Druid does support sub-second query latency for a single dataset with the proper configuration. However, we need to support ad hoc queries across multiple data sets. Druid also has limited join support requiring one data set to fit into memory which was not an option for us.

SELECT DS_THETA(traveler_id) as search_travelers 
FROM search_event
WHERE __time>= '2020-04-01' and product_line in ('LODGING') AND
destination = 'Paris,France'
AND trip_start_date BETWEEN '2020-10-01' AND '2020-10-14'
Sketch searchSketch = druidClient.sketchQuery(druidSQLSearch);
Sketch bookingSketch = druidClient.sketchQuery(druidSQLBooking);
AnotB anotB = SetOperation.builder().buildANotB();
anotB.update(searchSketch, bookingSketch);
log.info("Segment count: "+ anotB.getResult().getEstimate());

Ingestion Optimization

As an optimization, instead of using the raw data to create the DataSketch object at query time, at data ingestion time into Druid, the DataSketch can be created as another column in the data set. This will speed up query processing.

"metricsSpec": [ { "type": "count", "name": "count" },
{ "type": "thetaSketch", "name": "traveler_id_sketch",
"fieldName": "traveler_id" } ]
SELECT DS_THETA(traveler_id_sketch) as search_travelers 
FROM search_event
WHERE __time>= '2020-04-01' and product_line in ('LODGING') AND
destination = 'Paris,France' AND
trip_start_date BETWEEN '2020-10-01' AND '2020-10-14'

Putting it together

Now that we know how to store the time-series data in Druid efficiently and run queries that return a DataSketch object in under less than 5 seconds, we can put all the pieces together.

Additional Druid Optimizations

Once created, the sketch object query performance improved during ingestion in all the data sources, but it was still not at the expected level. Eventually we found out that the EC2 node type had to be changed because we were using EBS volumes without any local storage. The segments that didn’t fit in memory are stored on the disk and have to be fetched across the network, causing considerable delays because they read prefetched segments from their local disks to respond to queries. We changed to a node type with local SSD storage, and the performance improved a lot.


Approximate counting on Big Data is a potent cost-effective tool that greatly reduces the amount of data required at query time. And combining an approximate streaming algorithm (DataSketch) that supports set operations and a fast time-series datastore (Druid) can provide capabilities that previously could take hours.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store