Real-time analysis. Part 1: Meet Apache Druid, Apache Kylin and Apache Pinot

An overview of real-time data processing

Dima Baranetskyi
bigdatarepublic
9 min readAug 6, 2024

--

“The future is already here — it’s just not evenly distributed.” — William Gibson

Real-time data analysis holds immense potential, yet its value seems underrepresented in the market. While many data pipelines utilize stream ingestion, processing, and transformation, they often end up sinking data into relational databases or, at best, into well-known big-data-warehouse-lake-mesh-swiss-knife platforms. This article aims to explore the untapped potential of true real-time analysis.

I believe that real-time data analysis will allow us to anticipate trends and react to events as they unfold. In a sense, it gives us the ability to operate in the “future”. Imagine a world where vital sectors of society process data in real-time. How many opportunities could this open for us?

Real-time analysis

But what does a real-time data pipeline look like? What are the components of an ideal real-time data processing flow? Let’s examine this step by step.

First, we have to collect and import data from various sources into the big data system in a streaming fashion. We want to have certain guarantees like consistency, correct ordering, etc. Well-known players come to mind: Apache Kafka, Apache Pulsar, Apache Flume, and others. Real-time data ingestion is well-adopted in the market.

As soon as we have this stream of valuable information, we need to process and transform it continuously. This can be as simple as filtering or as complex as joining multiple streams utilizing different windowing and watermarking strategies. It sounds complicated, but we have some powerful tools available. The most prominent are Apache Flink and Apache Spark Streaming (Structured Streaming), the somewhat outdated Apache Storm, and Kafka Streams, which is tightly connected to Kafka. Real-time processing and transformation is not a trivial task, but it is definitely well-explored.

With the data cleaned and transformed, what’s next? We need storage for persisting large volumes of incoming data that supports quick writes and reads. Apache Cassandra, Apache HBase, and Apache Kudu are databases designed to handle high-velocity data ingestion while maintaining low-latency access for analytics.

Finally, we’ve reached the main topic of this article: real-time analysis. We must be able to query and analyze large volumes of data as it’s ingested. Most importantly, we have to provide immediate insights! I’ve chosen Apache Druid, Apache Kylin, and Apache Pinot to compare here. But first, let’s finish defining our ideal real-time data pipeline.

We’ve got our insights. Now it’s time to visualize them. In real-time. Tools like Apache Superset, Kibana, and Grafana allow us to monitor current trends, spot anomalies, and make data-driven decisions quickly in real-time!

It might seem like we’re done after making our decisions. But real-time analysis never ends, and there are more crucial processes to consider. For example, data quality and validation. Tools like Apache Griffin, Deequ, and Great Expectations help verify incoming data to ensure its accuracy, completeness, and consistency.

Additionally, we need real-time monitoring and alerts. We have to watch for system health and performance. Tools like Apache Ambari, Prometheus, and Cloudera Manager help us detect issues or anomalies in the data or the infrastructure.

Finally, there’s data integration. We’re combining data from multiple sources in real-time, and we have to ensure the data is harmonized and consistent. Some tools that can be helpful here are Apache NiFi, Apache Gobblin, and Airbyte.

It’s important to emphasize that we’re in a real-time data processing environment. These stages are not sequential; they’re ongoing. Not all of them exist in an average stream processing pipeline, and that’s fine. We have to start somewhere.

Apache Druid, Apache Kylin and Apache Pinot

“In the fields of observation chance favors only the prepared mind.” — Louis Pasteur

This perception of underutilized real-time analysis capabilities is a key driver for this article. By examining tools like Apache Druid, Apache Kylin, and Apache Pinot, we aim to shed light on the possibilities of true real-time analysis and how it can revolutionize data-driven decision making.

Apache Druid, Apache Kylin, and Apache Pinot are not the only tools available for real-time analysis. An ideal tool for this purpose should support certain key characteristics:

  • Real-time querying and analysis of large datasets
  • Ability to handle complex analytical queries quickly on large volumes of data
  • Support for distributed architecture
  • Capability to handle multiple simultaneous queries from many users
  • SQL support, which has become a de facto standard in data analysis

In the following sections, we’ll examine each of these technologies to see how they measure up against these criteria.

Apache Druid

“Necessity is the mother of invention.” — Plato

Indeed! The need for real-time analysis at Metamarkets led to the birth of Apache Druid. It’s a mature project, having been in development since 2011 and in production use at large companies for many years. Druid became an Apache top-level project in 2015, indicating a high level of maturity and community support. It boasts a large and active community, and is generally more recognizable than Pinot or Kylin.

Druid excels at real-time ingestion, though at the time of writing, its main streaming sources are Kafka and Amazon Kinesis. Compared to Kylin and Pinot, it performs slightly better than Kylin but has no clear superiority over Pinot in this aspect.

Regarding query performance, Druid promises sub-second query latencies on large datasets. It outperforms Kylin in this area, but its performance relative to Pinot depends on the specific use case — each has advantages in certain scenarios.

Druid can scale to handle petabytes of data and thousands of queries per second. Being older than Kylin and Pinot, it has a proven track record in more large-scale production environments.

Its flexible data model is more oriented towards time-series data, while Kylin, for example, is optimized for OLAP cube operations.

Druid’s SQL capabilities, while robust, are not as impressive as Kylin’s SQL support and are relatively on par with Pinot’s.

Druid’s column-oriented storage supports various compression techniques and is as efficient as Pinot’s, but Kylin may have an edge for certain OLAP workloads.

Apache Druid can be deployed on Kubernetes using its native operator, and there’s also a Helm Chart available, offering even more control over the deployment process.

Success stories:

  • Read about the role Druid plays in Airbnb’s analytics system architecture
  • Read how Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience
  • Read how Confluent is Scaling Apache Druid for Real-Time Cloud Analytics

Apache Kylin

“The best way to predict the future is to create it.” — Peter Drucker

Apache Kylin was originally developed by eBay in 2014 to address a specific challenge: performing OLAP queries on massive datasets stored in Hadoop. At the time, businesses increasingly needed to analyze big data quickly, but existing solutions were falling short. Now, Kylin is an Apache top-level project. It’s considered mature for its primary use case of OLAP on big data, but less mature in real-time capabilities. Kylin’s community is particularly strong in the OLAP and business intelligence domains.

Kylin excels at multidimensional analysis using OLAP cubes. Compared to Druid and Pinot, Kylin’s OLAP cube approach is better suited for complex, predefined analytical queries, while Druid and Pinot are more flexible for ad-hoc queries.

While Kylin’s real-time capabilities are not as strong as its batch processing capabilities, it does support streaming tables, for example, from Kafka.

Kylin provides comprehensive SQL support, including complex joins and subqueries. It especially shines with complex OLAP queries.

Deployment and maintenance are not as straightforward as for Druid or Pinot, but Kylin can be deployed on Kubernetes, and its documentation outlines all steps and provides examples. The complexity arises from Kylin’s dependence on the Hadoop ecosystem (which is probably its main drawback) and its batch-oriented nature, which can be challenging to manage in a dynamic Kubernetes environment.

Success stories:

  • Read how eBay build an Apache Kylin OLAP Cube Efficiently and Intelligently
  • Read how Cisco’s Big Data Team Improved the High Concurrent Throughput of Apache Kylin by 5x

Apache Pinot

“The future belongs to those who believe in the beauty of their dreams.” —
Eleanor Roosevelt

Pinot has been in development since 2013 and became an Apache top-level project in 2019. While younger than Kylin and Druid as an Apache project, it has been battle-tested at large companies like LinkedIn and Uber. Its community is actively growing, and it’s particularly strong in the real-time analytics domain.

Regarding real-time ingestion, Pinot potentially has an edge in some high-throughput scenarios compared to Druid and especially Kylin. It demonstrates high query performance on both real-time and historical data.

Pinot is designed for horizontal scalability, handling large amounts of data and high query loads. It offers advantages in certain high-scale scenarios compared to its competitors.

Pinot uses a flexible schema that supports both append-only and mutable data, which gives it an advantage over Druid.

Its SQL support is on par with Druid’s but not as comprehensive as Kylin’s. Pinot provides SQL support through PQL (Pinot Query Language) and standard SQL.

Like Druid, Pinot offers both a Kubernetes operator and Helm Chart. Consequently, deployment and maintenance should be relatively straightforward.

Success stories:

  • Read how LinkedIn is building Talent Insights to democratize data-driven decision making
  • Read how Uber does Real-Time Analytics for Mobile App Crashes using Apache Pinot
  • Read about Stripe’s Journey to $18.6B of Transactions During Black Friday-Cyber Monday with Apache Pinot

Final comparison

“Apples and oranges are both fruit, but they taste quite different.” — Kurt Cobain

Alright, let’s break it down and see how these three big data rockstars stack up against each other! All three — Druid, Kylin, and Pinot — are open-source powerhouses designed for analytics on large datasets. They all support SQL (to varying degrees) and can handle real-time data. But that’s where the family resemblance ends.

Druid is the time-series specialist, Kylin is the OLAP cube master, and Pinot is the new kid on the block with some serious real-time chops. Druid and Pinot are more flexible for ad-hoc queries, while Kylin shines with predefined, complex analytical queries.

When it comes to query performance, Druid and Pinot are neck and neck, both promising sub-second latencies. Kylin might lag a bit in real-time scenarios but can outperform the others in certain OLAP workloads.

For ingestion, Pinot might have a slight edge in high-throughput scenarios, with Druid close behind. Kylin is more of a batch processing champion, with real-time capabilities as a nice-to-have feature.

Consider Druid as your go-to for real-time analytics on time-series data. Think monitoring, clickstream analysis, or network telemetry.

Kylin is perfect for complex, predefined OLAP queries on massive datasets. Great for business intelligence and reporting in big data environments.

Choose Pinot for user-facing analytics requiring low latency on both real-time and historical data. Think LinkedIn’s Who Viewed My Profile or Uber’s Surge Pricing.

Conclusion and looking ahead

“Learn from yesterday, live for today, hope for tomorrow.” — Albert Einstein

We’ve met three powerful contenders — Druid, Kylin, and Pinot — each with its own superpowers and quirks. But let’s be real — this is just the tip of the iceberg. In the upcoming articles, we’ll dive deeper into each of these technologies. We’ll look at their architectures, explore some real-world use cases, and get our hands dirty with some code examples.

Important to remember, that choosing the right tool for your real-time analytics needs isn’t just about picking the shiniest new tech. It’s about understanding your specific requirements, your data patterns, and your team’s expertise.

Speaking of expertise, if you’re curious about how real-time analytics could benefit your business, feel free to reach out.

Stay tuned for the next installment in our real-time analytics series. There’s still a lot to explore!

--

--