Apache Pinot Upsert: trade off to make the easy things easy and the hard things possible

3 min readAug 21, 2022

Background

Apache Druid and Pinot are two popular real-time distributed OLAP datastore. They have similar architecture design of online (realtime) and offline (historical) nodes. Following are architecture diagrams from their websites.

Druid

Pinot

Apache Pinot have two interesting features over Druid: StarTree index and Streaming Upsert. In previous post we have talk about Star Tree index and this post will talk about Streaming Upsert.

Challenge of Upsert

Why upsert?

Upsert is normal in OLTP (e.g. MySQL, PostgreSQL) world, but not popular in OLAP word. This blog explains some upsert is need in OLAP and this slide 12 shows several upsert use cases in Uber.

Does Druid/Pinot support update?

Yes, both supports update in batch job, but it is not real time manner.

Does Druid/Pinot support upsert?

Pinot starts supporting streaming upsert since v0.6.0 (2020–11).

Druid does not support upsert as of 2022–08.

Why is it challenging?

Druid and Pinot empowers similar design and periodically seal segments into Deep Storage. The segment is immutable. For batch job, it can rebuild the whole segment and overwrite it. But in streaming world, rebuild sealed segment is not reasonable.

How — High Level

If the segment is immutable, how can we support upsert? If you are familiar with LSM tree, one way is append update record for same primary key in new segment. Then during query time, scan by primary key from multiple segments and merge the records before response back.

In Pinot, the segments are distributed across multiple servers. The next question is how does it know which server contains which primary key?

How — Global Coordinator (discarded)

The 1st approach is to introduce a global key coordinator (GKC). GKC receives metadata (server, segment, document, primary key) from a kafka topic and keeps a global mapping in a KV store.

After more than 1 year development, Pinot still facing a few problems around scalability, stability and query rewrite complexity. For full details, please refer to Challenges/Issues with the current GKC design.

Thus pinot team revisit the problem and propose 2nd approach.

How — Local Coordinator (accepted)

The 2nd approach adds a constraint “Partition the Kafka topic by the primary key” and simplify the architecture significantly.

Under the constraint, same primary key will be dispatched to same servers and each server just need to keep a local mapping. This trade off could increase complexity for rare use case, but in general, “it make the easy things easy, and the hard things possible”.

An example of upsert flow looks like following.

Limitation and workaround

Tech design list out some limitations about partition, repartition and rebalancing. There are some workaround and maybe more complex operations, but it is doable. For full detail, please refer to Limitation.

Another limitation to highlight is Upsert is not compatible with Star Tree index. It is kind of make senses, since Star Tree pre-aggregation during ingest time. (docs)

Reference

[2020–06] Upsert design revisit
[2021–02] Upsert in Apache Pinot (youtube)
[2021–04] https://medium.com/apache-pinot-developer-blog/introduction-to-upserts-in-apache-pinot-987c12149d93
[2021–07] Kafka Summit 2021: Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot (video and slides)
[Latest] https://docs.pinot.apache.org/basics/data-import/upsert