Ingesting data at “Bharat” Scale

Bhavesh Raheja
Disney+ Hotstar
Published in
7 min readJul 18, 2019

--

A while ago, we introduced data democratisation at Hotstar. Today, we would like to introduce Bifrost, our data management platform at Hotstar.

Bifrost is an attempt at democratizing and abstracting the production and consumption of data at Hotstar. It is a platform that encourages best practices for ingesting data, complying with necessary data governance, and sharing data with internal and external services. It abstracts underlying complexities in producing, consuming and processing data.

10,000 ft view of Bifrost — Data Platform at Hotstar

In this blog, we will focus on the ingestion API that collects a variety of data that is used for insights across the organization.

Using a thin SDK on the client

Hotstar has a massive installed base and we also have a swathe of heterogeneity across 11 client renditions. Each of these clients embed a simple SDK in the app that is responsible for sending analytics data, and provides a uniform resiliency in the clients in the interactions with the analytics collector. We control payload sizes, data push frequencies and batch-sizes.

Traffic flows over HTTPS; wrapping the behaviour within the SDK allows us to perform batching. Failures, retries, timeouts are all managed by the SDK & are not affected when the client/user changes view-port or starts to move to another activity.

The advantages of statelessness coupled with delivery & retry mechanisms over the HTTP protocol win over any disadvantages of session creation & handshakes.

As our scale increases down the line, we may consider leveraging our very own Pubsub system that can support over 50 million concurrent connections.

Asynchronous Produce

The goal of the ingestion API is to collect the data at high scale. The API pushes the data into the message bus (Kafka) and all downstream consumption occurs from there.

A usual flow for a handler would be

Response handleRequest(Request req) {
// Validate Data
// Push to kafka
producer.produce(..)
// Return 200 OK}

The critical decision we made was the behaviour of the produce call. Should we wait till we have confirmed the produce or not? What if the produce fails? Should we return a 5xx and ask the client to retry?

We chose the asynchronous produce path. We leverage channels heavily where sending data over the produce channel is sufficient enough for a successful produce. There are chances that the produce may fail, or the Kafka cluster is down, but we will talk about that in detail in our next section.

With a synchronous paradigm, if one of our brokers was down and/or we had a spike in Kafka produce latency, it would add too much back pressure on the servers, and surging retry requests would compound the effect further causing a bunch of cascading failures.

Keeping the asynchronous paradigm helps us scale by maintaining a very low latency for our client. For the client, once a 200 OK response is provided, it means the data will reach the systems.

Availability vs Latency

I would highly recommend reading Fronting : An Armoured Car for Kafka Ingestion while building a fronting layer for Kafka. We chose a similar path i.e. trade latency for increased availability. The value add this provides to users of Bifrost is that once a 200 OK is returned, the data is guaranteed to be delivered, albeit at a higher latency during special circumstances.

We built a failover mechanism using hystrix, a circuit breaker mechanism that provides built in support for automated as well as manually induced panic. In the case of Bifrost API, panic implies latency of data-delivery has increased, while delivery continues to remain guaranteed.

Initially, we considered Redis for our failover store, but with serving an average ingestion rate of 250K events per second, we would end up needing large Redis clusters just to support minutes worth of panic of our message bus. Finally, we decided to use a failover log producer that writes logs locally to disk. This periodically rotates & uploads to S3.

Circuit Breaking Flow

Automated Failover During Data Surges

We’ve built an automated failover recovery mechanism (a source connect) that reads these files from S3 and publishes to Kafka. At Hotstar, we ❤ Kafka Connect, and given the requirements of this failover scenario, it fit the situation well. Some of the reasons for choosing Kafka connect are:

  • Its always running → it does not need manual switches to turn on OR off
  • Can be scaled during outages for longer period → can increase number of tasks if more files need to be read from S3
  • Has retry & state management built in → will automatically retry from the last file offset from where it left off.

This system, though, is still not a 100% perfect. In the rare scenario that the data is written to a log before being rotated & uploaded to S3, and the box dies, we will still lose that data. We’ve minimised this window by aggressively rotating logs at quick intervals, but there’s still that chance. Preserving the log-disks, re-mounting & then processing those failed upload logs is definitely doable, but we haven’t encountered a situation that required adding this additional layer of complexity to our system.

You send, I’ll scale

We’ve built a system that is agnostic to the backend dependencies, provides a high degree of availability to clients & can guarantee that data once sent will eventually reach customers. The API can now be inundated with millions of requests and given its low-level SLAs, it will sustain the traffic. Let’s go into the details of how we provisioned & scaled this API.

We are entirely on the AWS stack & while simple CPU based auto-scaling works for most cases, it’s not very robust here.

Traffic pattern for Bifrost API

Hotstar viewing behaviour is very similar to TV viewing behaviour; as we head to primetime, the traffic pattern grows. We can see the traffic varying between 100k RPM to over 4 million RPM during the day. Over-provisioning for the peak and relying on auto-scaling were not an option, given the rapid build up of volumes.

We leveraged target tracking scaling which works very well for our use-case, since it keys off requests per second (RPS).

While this works for 90% of the use-cases, target tracking is horrible for surges. There are certain traffic patterns that cause the traffic the spike to 5x, and then immediately fall back to normal within a minute. These could be due to key game moments or notifications. Target tracking though, reacts at least over a few minutes, so we notice that scaling lags the actual traffic pattern.

Number of requests (Yellow/Green) against Left axis and Number of instances by target tracking (Blue) against right axis

We are actively working on figuring out a better strategy to handle these traffic patterns.

Not everything is infinitely elastic

As we build for Hotstar’s scale, we do realise that there are limits to how fast we can react. We’ve seen outages, where our origin crashes & as it tries to recover, it is inundated with client retries & pending requests in the surge queue. That’s a recipe for cascading failure.

Outage during which API returned 5xx. Notice small spikes during outage & post recovery the large spike in # of retried requests

Velvet Rope: You hang on to it for a bit

Adding target tracking scaling & better instance configurations does help to a certain extent; but at some point, you’ve got to draw the velvet rope.

We want to continue to serve the requests we can sustain, for anything over that, sorry, no entry. So we added a rate-limit to each of our API servers. We arrived at this configuration after a series of simulations & load-tests, to truly understand at what RPS our boxes will not sustain the load. We use nginx to control the number of requests per second using a leaky bucket algorithm.

The target tracking scaling trigger is 3/4th of the rate-limit, to allow for the room to scale; but there are still occasions where large surges are too quick for target-tracking scaling to react.

Want more?

There’s a lot more that went into Bifrost API than what was mentioned here. Queue tuning & Kafka producer optimisations are huge aspects that deserve blogs of their own. If you found this interesting, also checkout related articles around Bifrost & our data platform

and other talks that are here.

We’re hiring — if you want to participate in similar adventures, please check out tech.hotstar.com for job listings!

--

--