How we use Apache Druid’s real-time analytics to power kidtech at SuperAwesome
Co-authored with Saydul Bashar
Here at SuperAwesome, our mission is to make the internet safer for kids; to help accomplish this goal, our products power over 12 billion kid-safe digital transactions every month.
Digital transactions come in many forms and could be:
- An ad served to a kid’s device through AwesomeAds
- A video view on our new kids’ video gaming platform, Rukkaz
- A like, comment, post or re-jam on PopJam
Every digital transaction is processed to be instantly available for real-time analytics.
In kidtech, kid-safety and privacy protection are paramount, and a traditional approach to analytics and data engineering wouldn’t necessarily be COPPA and GDPR-K compliant.
What makes a traditional digital transaction kid-safe is the absolute absence of personal identifiable information (PII), which is the foundation of our zero-data approach. This is the main characteristic that makes our real-time analytics kid-safe.
Our kid-safe real-time analytics allow us to make the best and quickest decisions for our products and services, as well as our customers, and it enables us to work and iterate in a data-driven way.
Aside from helping us make product and customer decisions, real-time analytics is also used to power some of our products. This is the case for AwesomeAds, where we use this data to drive real-time decision making.
When it comes to collecting, processing and storing this mammoth number of transactions with efficiency and durability, we found Apache Druid to be the perfect database for the job.
What is Apache Druid?
Apache Druid is an open source distributed data store. Druid’s core design combines ideas from data warehouses, time series databases, and search systems to create a unified system for real-time analytics for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture. — druid.io
In essence, Apache Druid is a highly performant, scalable and resilient database with low latency and high ingestion rate.
Druid has such high performance due to a number of reasons:
- It uses column-oriented storage, therefore it only needs to load the exact columns needed for a particular query
- The high scalability of the database allows query times of sub-second to a couple of seconds
- Druid can perform queries in parallel across a cluster, meaning a single query could be processed on many nodes
- Druid creates indexes that power extremely fast filtering and searching across multiple columns
How have we set up our Druid infrastructure?
Infrastructure-as-Code
At SuperAwesome, we’re big believers in Infrastructure-as-code. We use Terraform to set up our infrastructure on AWS, and we use helm to set up and manage our infrastructure inside our Kubernetes clusters.
Some key advantages of following Infrastructure-as-code practices are:
- Everything is in code! We have a single source of truth with the definition of all our infrastructure
- We have immutable infrastructure. We don’t make changes through the AWS console or the Kubernetes dashboard, we just define what our infrastructure should look like in a declarative language and we let the tool do it for us
- It helps us maintain environment parity. We can then have the same template with different values for different environments
Druid, Kubernetes and Hadoop
Our different Druid processes run in StatefulSets in order for us to provide them with persistent volumes. Although our Druid setup is stateless, having data available on disk enables our processes to recover much faster when they restart.
These Druid processes communicate with each other through domain names. Kubernetes pods change IP addresses when they are recreated, therefore by using the domain names, we allow the overlord to see that the recreated middle manager / historical nodes are the same. This helps us avoid any need of re-networking between the processes and allows for smooth re-creations.
What does our Druid architecture look like?
Data Ingestion
In order to collect the best analytics from our products, our digital transactions are sent to Druid using Apache Kafka events. We found that Apache Kafka is an ideal system to integrate with Druid, particularly for its properties that make exactly-once ingestion possible.
Kafka events are pulled into Druid by the middle manager
nodes as opposed to being pushed in by brokers. This allows the middle managers
to manage their own rate of ingestion, and as each Kafka event is tagged with metadata related to their partition and offset, the middle managers
can verify that they received what they expected and that no messages were dropped or re-sent.
These data ingestion tasks are distributed to the middle manager nodes using the overlords
— once consumed, the Druid segments are stored in deep storage (S3). As long as Druid processes can see this storage infrastructure and access these segments, there will be no data loss in the event of losing any Druid nodes.
Data Loading
The coordinator
process is responsible for assigning newly added segments to the historical
nodes for local caching — this is done with the help of Zookeeper
.
When a historical
node notices a new segment has been added to deep storage, it will check in its local cache for whether it has information on this segment. If that information doesn’t exist, the historical
node will download metadata about this new segment from Zookeeper
. This metadata would include details about where the segment is located in deep storage, as well as how to decompress and process the segment.
Once the historical
node has finished processing the segment, Zookeeper
is then made aware of the processed segment, and the segment becomes available for querying.
Querying
When it comes to relaying these events back to the client, queries are processed by the broker
nodes which evaluate the metadata published to Zookeeper
about what particular segments exist on each set of nodes (either middle managers
or historicals
), and then routes the queries to the correct nodes to retrieve those particular segments.
Re-indexing
We ensure we regularly re-index segments using Apache Hadoop, which has been set up in AWS EMR using Terraform. Re-indexing (also known as re-building segments) reduces the number of rows in the database, therefore helping to ensure our queries won’t get slower over time.
When we ingest real-time data, the granularity we maintain is hourly. At the end of each day, we have Kubernetes cron jobs which deploy re-indexing jobs on Hadoop to roll-up these hourly segments into daily segments. Finally, at a certain point during the month, we re-index these daily segments into monthly segments. In essence, the idea behind re-indexing is to roll-up our metrics into larger time frames — this, of course, means we aren’t able to query the smaller granularities, but it greatly reduces our storage requirements costs and query times.
Final Thoughts
Thanks to our approach to real-time analytics, we’re able to relay key insights and data points back to our stakeholders and customers, as well as use this data to power our products themselves. This all happens in a kid-safe way and enables us to deliver the best level of service to billions of U16s every month.
Our approach to kid-safe analytics and data engineering keeps evolving and we are always looking for bright minds and driven individuals to take it to the next level. Do you want to be part of it? Check out our job openings here.