Introducing Apache Pinot 0.4.0

Haibo Wang
Apache Pinot Developer Blog
3 min readJun 17, 2020

--

Apache Pinot, a modern OLAP platform for event-driven data warehousing

We are excited to announce that Apache Pinot 0.4.0 has been released in June 2020. Apache Pinot is a real-time distributed datastore, designed to answer OLAP queries with low latency. In this release, we are introducing lots of new features, refactoring and bug fixes to make Pinot more performant, extensible and reliable. We also have a brand-new website and documentation for our users and developers. In this post, we will be highlighting a few features in this release.

Column transformation during ingestion

When data was ingested into Pinot, all columns in the source were ingested as is. However, many times we had to do some transformation on some columns, and the only approach was to do the transformation in the source prior to ingestion. In 0.4.0, we added the support for transformation functions using Apache Groovy, which allowed users to define derived fields directly in the schema. For example, if we have columns firstName and lastName in the source data, we can define a new field fullName in the schema with the following definition “transformFunction” : “Groovy({firstName + ‘ ‘ + lastName}, firstName, lastName)”, and we will have the derived field fullName in the storage. Previously, we would have to add this additional layer of concatenation in the source data for Pinot to ingest directly.

Efficient handling of range queries using indexes

Pinot had three types of indexing — inverted index, sorted index, and star-tree index, all of which are good for exact match filters, but not necessarily range filters. For range queries, we can use binary search when the column is sorted, but usually not all columns are sorted. Inverted index might help reduce the search space by filtering on the index, but when the column cardinality is high, performance would be as poor as full scan, especially when there are billions of records. In 0.4.0, we introduced the concept of range index, which allowed us to get the best of both worlds — scan range indexing (limit index overhead) and inverted index (fast filtering). It also allows us to find a sweet spot between one extreme (scan) and another extreme (inverted index), by using an appropriate value for number of ranges or num documents per range. This can especially improve query performance for range queries on time and metric columns.

New S3 Filesystem Plugin

In release 0.3.0, we introduced the notion of plugin, allowing new extensions to be added in a plug-and-play fashion. We soon benefited from this re-architecture. In 0.4.0, Amazon S3 filesystem integration was contributed, which added one more filesystem solution in addition to HDFS (Hadoop Distributed File System), ADLS (Azure Data Lake Storage) and GCS (Google Cloud Storage).

Theta-sketch Based Distinct Count Aggregation Function

In 0.4.0, we implemented the initial version of theta-sketch based distinct count aggregation function, utilizing the Apache DataSketches library which provides fast approximation for big data analytics. With this, we also supported multiple arguments in the aggregation functions. Stay tuned for more optimizations in the next release.

Besides the new features highlighted above, we also added lots of optimization for existing features including text search, star-tree index, cloud integration, etc. Feel free to check out the release note for more details.

Download page: https://pinot.apache.org/download
Getting started: https://docs.pinot.apache.org/getting-started

Special thanks

We would like to take a moment to thank our mentors Felix Cheung, Jim Jagielski, Kishore Gopalakrishna and Olivier Lamy for their mentorship and support for the incubation of Apache Pinot, and huge shout-outs to our committers and contributors for their contribution for this release: Akshay Rai, Alexander Pucher, Bo Zhang, Charlie Summers, Chethan UK, Dan Hill, Daniel Lavoie, Elon Azoulay, Haibo Wang, Harley Jackson, James Shao, Jialiang Li, Jihao Zhang, Kartik Khare, Kenny Bastani, Konrad Malik, Mayank Shrivastava, Neha Pawar, Ravi Singal, Seunghyun Lee, Siddharth Teotia, Subbu Subramaniam, Tamas Nemeth, Ting Chen, Vincent Chen, Xiang Fu, Xiaohui Sun, Xiaotian (Jackie) Jiang.

--

--