From Chaos to Control: Druid’s Autoscaling🚀

Rahul Sharma
PlaySimple Engineering
4 min readSep 2, 2024

Let’s kick things off by getting to know Apache Druid — the superstar of real-time analytics databases. Picture this: you’re dealing with a colossal dataset, and you need answers fast — like, sub-second fast. Enter Druid! It’s designed to handle petabytes of data like it’s no big deal, giving you lightning-fast queries and real-time insights. Whether you’re analyzing user behavior, monitoring network performance, or hunting down anomalies, Druid’s got your back with its blend of OLAP, time-series magic, and search prowess. Plus, it’s got a scalable, distributed architecture that’s the core strength behind many data-driven applications.

For more info about Druid architecture please refer: Druid Architecture

Today, we’re zooming in on one of Druid’s behind-the-scenes heroes: the MiddleManager. This little engine is the muscle behind data ingestion in Druid. Our focus? How to scale MiddleManagers effectively without breaking the bank — or breaking anything, really.

The Challenge: Scaling MiddleManagers Without Losing Your Mind (or Your Money đź’°)

Scaling Druid’s MiddleManagers is a bit tricky, especially when your workloads are all over the place. If you’ve ever used Horizontal Pod Autoscalers (HPA) and watched in horror as your ingestion jobs crashed because the autoscaler decided to downscale at the worst possible time, you know the struggle. We’ve been there too 🫂, and after enough iterations, we came up with our own solution — a custom autoscaler that’s as smart as it is cost-effective🥳. No more paying for MiddleManagers to sit around doing nothing, and no more surprise ingestion failures!🫡

The Problem: Inefficient Scaling and Ingestion Woes

We host Druid on Kubernetes and, in the beginning, had a couple of MiddleManager pods working around the clock. These pods were running 24/7, even when they had nothing to do. As our data needs grew, so did our MiddleManager count, and so did our costs.

To make things smarter, we turned to HPA, hoping it would scale MiddleManagers up and down based on CPU or memory usage. But soon, ingestions started failing. The problem? MiddleManagers are part of a StatefulSet, and when Kubernetes downscaled the nodes, it didn’t check if a pod was mid-task — resulting in some sad, incomplete ingestions.

We tried changing Druid’s task distribution strategy to fullCapacity to avoid these issues. It helped, but it wasn’t foolproof. Imagine if you had four MiddleManagers with 15 task slots each, and three of them finish their tasks while the fourth is still going. The CPU/memory usage drops, HPA swoops in to downscale, and boom — another task got killed🤯.

Introducing MIDAS: Middle Manager Intelligent Dynamic Autoscaler

After realising that existing solutions just weren’t cutting it, we rolled up our sleeves and built our own autoscaler.

How It Works: The Nitty-Gritty

Every 30 seconds, our scheduler checks in with the /druid/indexer/v1/pendingTasks endpoint to see if there are any tasks waiting. Based on what it finds, it adjusts the number of MiddleManagers. Here’s the logic in a nutshell:

WORKERS_PER_MM = 15
MAX_NO_MM = 4

required_mm_count = (pending_tasks / WORKERS_PER_MM) + current_running_mm_count
if waiting_tasks % WORKERS_PER_MM > 0:
required_mm_count += 1

required_mm_count = min(MAX_NO_MM, required_mm_count)
scale_replica_to(required_mm_count)

We keep tabs on the current MiddleManager count using the /druid/indexer/v1/workers endpoint and to get number of running tasks on particular middle manager we used /druid/worker/v1/tasks endpoint.

If a new MiddleManager is needed, we spin it up. But since this takes a few minutes, we save the state and make sure the autoscaler doesn’t go overboard.

When it’s time to downscale, we do it smartly. We start by disabling the last MiddleManager in the line — no new tasks for it! If one of the earlier MiddleManagers can take over, we wait for tasks on the last one to finish, then downscale it.

Here’s our weapon for aggressive downscaling:

cap_to_check_against = 0.9 * (no_of_running_mm - (disabled_mm_cnt + 1))
no_of_running_tasks_to_check = total_running_tasks - last_mm_running_tasks

if cap_to_check_against > no_of_running_tasks_to_check:
disable_last_mm()
disable_mm_cnt += 1

if last_mm_idle and disabled:
scale_down_to(curr_running_mm_cnt - 1)
disable_mm_cnt -= 1

If new tasks roll in, we first check if we can re-enable any disabled MiddleManagers before spinning up a new one. Simple, smart, and effective.

The Result: Smarter, Cheaper, and More Reliable Scaling

By implementing this custom autoscaler, we cut our Druid infrastructure costs by 30%. Now, we can scale Druid’s MiddleManagers without worrying about unexpected ingestion failures, and we’re paying only for the resources we actually use. It’s a win-win!🏆

Ready to Scale Your Druid Like Never Before?

We’ve got something special for you — our custom autoscaler is now open-sourced! 🎉 That’s right, we’re sharing the love. All you need to do is grab the code, configure the autoscaler’s settings to match your needs, and deploy it in the same cluster as your Druid and set overlord config from EqualDistribution to fullCapacity. It’s that simple. With just a few tweaks, you’ll be scaling your ingestions effortlessly and slashing your infrastructure costs. Happy scaling! 🚀

MIDAS REPOSITORY

--

--