Apache Pinot Star-Tree Index: Space-Time Trade Off in OLAP

Rich
4 min readOct 3, 2021

--

Background

Modern data lake separates compute and storage for scalability and availability, e.g. AWS Athena+S3, Google BigQuery. It provides reasonable fast speed (seconds) when analyse TB to PB scale data, which is suitable for batch ETL and ad-hoc experimental analytics.

However, in the field of user facing analytics, it is expected to provide milliseconds response time and high throughput. A common technique to speed up is to generate a data cube based on dimensions, using more space to buy out less time.

Apache Druid and Pinot are two popular real-time distributed OLAP datastore to solve this. However, Apache Pinot have two interesting features over Druid: StarTree index and Streaming Upsert. In this post, we talk about Star Tree index and next post will talk about Streaming Upsert.

Problem

Assume an e-commerce business has a dimension table of users(100k rows with schema <customer_id, gender, country, state, city>) and a fact table of orders(1 billion rows with schema <order_id, date, customer_id, amount, num_of_items>). The goal is to find the monthly sales sum(amount) from all female customers in California over past year.

Pre-Aggregation

We have know the metrics sum(amount)we want to query, we can pre-aggregate to a new table user_orders_aggregated(with schema: <date, gender, state, city, sum_amount>).

Assuming there are 1 million unique combinations of date/gender/location per month, the total number of records after pre-aggregation could be reduced from 1 billion to 12 Million rows for one year worth of data.

Pre-Aggregation reduces the data scan, but still need some runtime aggregation.

Pre-Cube

Pre-cube is a special pre-aggregation, which exactly match what we want to query. We create

a new table user_orders_cubed (with schema: <month, gender, state, sum_amount>).

Assuming there are hundreds of unique combinations of gender/state per month, the total number of records could be reduced to thousands rows for a year worth of data.

From the example, as you can see, the record size is decreasing from raw to pre-aggregation to pre-cube, thus gives us lower latency and high throughput. The down side is less flexibility. If we have different queries on different dimensions, we need build different cubes specifically for the query to keep the high throughput, thus more space usage.

Question

Is there a system that allows developer to flexibly config the trade-off between latency (time) and storage (space), based on different use cases and requirements?

Idea

The main factor for the latency/speed is number of records to scan. The less records to scan, the faster the query. If the number of scan records is configurable, then the latency/speed of a query is predicable. That is basically the idea of Star-Tree index of Pinot.

Solution

Given a SplitThreshold (T), Star-Tree index splits the record to child nodes on next dimension until each node contains no more than T records. Given any query, it guarantees to only scan at most T records.

(ps: of course, T is feasible if it has enough number of dimension key-value pair)

For example:

  • query select xxx where D1=V1 will scan node D1-V1 -> D2-Star
  • query select xxx where D2=V2 will scan node D1-Star -> D2-V2
  • query select xxx where D1=V1 and D2=V2 will scan node D1-V1 -> D2-V2

As shown in experiments, for an airflight trips dataset, setting SplitThreshold T = 1000 archives similar throughput as pre-cube, but only 1/3 of extra space.

It is also worth to point out that T=1k have 2x throughput than T=10k, with only 25% extra space.

Summary

Star-Tree index provides an adjustable way to config space and time trade off. In my opinions, it would be beneficial if events have high number of dimensions, with low cardinality (Pinot by default only index dimension if it has ≤ 10k cardinality).

Reference

--

--