Rise of the Streaming Databases — Episode 2 : Apache Pinot

How Pinot solves the toughest problems in the data analytics today with its low-latency, high throughput query capabilities

Published in

Tributary Data

8 min readSep 25, 2021

This post is the second installment of my streaming databases series, which talks about a special breed of databases that are meticulously designed to perform OLAP workloads on large data sets.

Today, we discuss Apache Pinot, an open-source real-time OLAP database built at LinkedIn and powers many production workloads at many companies today.

Note that this is a beginner-level article which goes well for readers who are new to Pinot. Please check the Pinot blog for more in-depth content.

What is Apache Pinot?

The documentation says:

Apache Pinot, a real-time distributed OLAP datastore, purpose-built for low-latency high throughput analytics, perfect for user-facing analytical workloads.

If that jargon-filled sentence doesn’t answer all of your questions, let me break it down into smaller pieces and have it another go.

For the sake of simplicity, think of Pinot as a black box where you can feed both historical(like HDFS, S3) as well as real-time data (like Kafka) and lets you ask questions about them using SQL.

Pinot answers your questions while having these in mind:

Answers contain fresh data — As soon as the data is ingested, Pinot makes them available for querying, typically within seconds. So, you won’t get any stale data in the answers.

Answers will be quick — Pinot makes sure that you will always get an answer within milliseconds of latency, even though it is super busy or having to scan billions of records to find the answer.

Can answer multiple questions concurrently — You may not be the only one querying Pinot. It could be hundreds or even millions of users querying Pinot concurrently. But Pinot makes sure that it scales and be available to accommodate all that questions.

I think now you have a better understanding of what Pinot is. In the coming sections, let’s explore how Pinot achieves the above qualities.

Where is Pinot coming from?

Pinot was built by the genius minds at LinkedIn and Uber. It was initially used for user-facing analytics features like ‘Who View My Profile?’ and ‘Talent Search’ at LinkedIn.

The word has quickly spread about Pinot’s ability to crunch petabytes of data within milliseconds to process a SQL query. Hence, companies like Uber started using Pinot to power their “Restaurant Manager” dashboard.

Uber Eat’s Restaurant Manager was where Pinot had flexed its muscles for the first time.

What problems does Pinot solve?

Pinot is a real-time OLAP database. It fits nicely to solve the most challenging problems in user-facing analytics, BI and data analytics, and operational intelligence domains.

User-facing analytics

Imagine taking a classic data warehouse and exposing it to live Internet traffic — to be queried by millions of users concurrently.

It will survive for few minutes before crashing. But Pinot will not.

Dashboards — Pinot has been purpose-built to power user-facing applications and dashboards that are supposed to be accessed by millions of users concurrently. While doing so, Pinot maintains stringent SLAs, which are typically in milliseconds range to ensure a pleasant user experience.

Personalization — Apart from that, Pinot is good at performing real-time content recommendations. For example, Pinot powers the news feed of a LinkedIn user, which is based on the impression discounting technique. You can feed clickstream, view stream, and user activity data to Pinot to generate content recommendations on the fly.

Ad-hoc querying and exploratory data analysis

Although not as challenging as the above, Pinot is also good at interactive analytics, where complex queries are issued against large data sets while demanding a minimal query latency.

Pinot helps data analysts and data scientists query large amounts of data, test hypotheses, run A/B testing, and build visualizations or dashboards.

Operational intelligence and time-series data processing

Pinot can be used as a great time-series database (TSDB), making it eligible to store a vast amount of telemetry data from servers, applications, sensors and query them to detect anomalies immediately.

Also, Pinot allows querying historical data to perform debugging and root cause analysis (RCA). For example, you can quickly check the last week’s server telemetry to investigate why there was a drop in the traffic.

Infrastructure monitoring, application performance monitoring (APM), and IoT are few practical domains where Pinot comes in handy.

Third Eye, which is an ad-hoc analysis solution is used at LinkedIn.

How does Pinot stores data?

Before we get down to the internals, let’s talk about Pinot’s data storage model, which significantly influences the whole architecture.

Storage model — tables, schemas, and segments

Pinot stores data in a columnar format and adds additional indices to perform fast filtering and aggregations, which leads to faster query performance.

Segments — Raw data ingested by Pinot is broken into small data shards, and each shard is converted into a unit known as a segment. A segment is the centerpiece in Pinot’s architecture which controls data storage, replication, and scaling.

Tables and schemas — One or more segments form a table, which is the logical container for querying Pinot using SQL/PQL. A table has rows, columns, and a schema that defines the columns and their data types.

Tenants — A table is associated with a tenant. All tables belonging to a particular logical namespace are grouped under a single tenant name and isolated from other tenants.

If you are familiar with log-structured storage like Kafka, a segment resembles a physical partition while a table represents a topic. Both topics and tables expect to grow infinitely over time. Therefore, they are partitioned into smaller units so that they can be distributed across multiple nodes.

Pinot architecture

Now that we learned what Pinot is, what it can do, and how it stores data. In the coming sections, let’s explore some inner workings of Pinot.

Pinot is a distributed system

Pinot solves petabyte-scale data problems. Thus, it is designed as a distributed system with many components playing a distinct role in its architecture.

As with any well-designed distributed system, a Pinot cluster embodies the following design principles.

Highly available — There’s no single point of failure. Pinot continues to work in the face of node failures.
Horizontally scalable — You can increase the overall query throughput by adding more nodes to the cluster.
Query execution is decoupled from cluster maintenance — Queries are executed independently to data ingestion, index configurations, and cluster rebalancing.
Centralized cluster coordination and state management — Cluster state and inter-component coordination is managed by Zookeeper and Apache Helix. That controls the resource scheduling, self-healing, and scaling aspects of the cluster.

Pinot Controller

You access a Pinot cluster through the Controller, which manages the cluster’s overall state and health. The Controller provides RESTful APIs to perform administrative tasks such as defining schemas and tables. Also, it comes with a UI to query data in Pinot.

Pinot Broker

Brokers are the components that handle Pinot queries. They accept queries from clients and forward them to the right servers. They collect results from the servers and consolidate them into a single response to send it back to the client.

We will discuss the query execution model in detail soon.

Pinot Server

Servers host the data segments and serve queries off the data they host. There are two types of servers — offline and real-time.

Offline servers typically host immutable segments. They ingest data from sources like HDFS and S3. Real-time servers ingest from streaming data sources like Kafka and Kinesis.

Pinot Minion

Minion is an optional component that can run background tasks such as “purge” for GDPR (General Data Protection Regulation).

Querying data in Pinot

Pinot executes queries in a scatter-gather manner instead of the databases that leverage the materialized views where query result has been precomputed.

Everything happens on-demand in real-time. I would say this is brave!

Pinot utilizes intelligent indexing and some pre-aggregation (like theta sketches for cardinality estimation) techniques to achieve sub-second query latency.

Scatter-gather execution model

Queries are received by brokers — which checks the request against the segment-to-server routing table — scattering the request between real-time and offline servers.

The two servers then process the request by filtering and aggregating the queried data, then returned to the broker. Finally, the broker consolidates each response into one and responds to the client.

The diagram below sums up the above model.

Query interfaces

You can use standard SQL for querying. Apart from that, Pinot provides a set of RESTful APIs for programmatic querying.

For interactive query execution, Pinot gives you a query console UI.

Query Console for interactive data analysis

Takeaways

Pinot is a database that ingests data in real-time and makes them available for immediate querying with millisecond latency.
That was made possible by its columnar storage model, intelligent indexing, and pre-aggregation techniques.
Pinot executes queries on-demand using a scatter-gather model, which is in contrast with materialized view-based databases.
Pinot supports querying with standard SQL. Also, it provides you with APIs to access data programmatically.

If you are a data engineer, analyst, or data scientist, consider Pinot a one-stop-shop for large-scale data analysis. You might say it comes with an operational overhead as a price. But companies like StarTree have already started adopting Pinot to provide managed services to ease the burden of operating Pinot at scale.

If you like to get hands-on with Pinot, try the below tutorial of mine.

Building a Low-Latency Fitness Leaderboard with Apache Pinot

Use Apache Pinot to ingest fitness band events from a Kafka topic and make them available for immediate querying from a…

medium.com

References

Pinot docs