Designing and Scaling a PetaByte Scale System — Part 4: Extended Predicate Pushdown in Spark with Apache ORC

Published in

Airtel Digital

6 min readMay 25, 2022

Why It is important, How we achieve it and performance enhancements

The biggest and latest trend in the big data architecture world is the “Separation of compute and storage clusters”. There are many benefits of this architectural design. We can scale both clusters independently, providing development teams better agility and resulting in lower TCO for the organization.

But this also results in many new performance-related challenges like no data locality, serialization/deserialization problems and constrained bandwidth between compute and storage clusters.

In one of our big data applications, we receive around 5PB of data every year and there are multiple years of data in our storage. Usually, in a single query, we need to scan complete one-year data and there are multiple such queries per day. Scanning one year's worth of data at 100GB/sec speed will take around 15 hours for 1 such query. So we needed to solve this problem where we had limited bandwidth between storage and compute and needed unlimited bandwidth for supporting SLAs of query application.

Default Predicate Pushdown in ORC with Spark

Spark push down a part of the filter to ORC Reader in form of SearchArgument which is executed by ORC Reader. In the first step, the file footer is read and file-level metadata is read. Then the Stripe footer is read, where min/max/null/bloom filter information is loaded into memory for selective columns and then the filter is applied on this stripe footer metadata and a list of rowIndexGroup is identified and selected to read which satisfies this SearchArgument. Generally, a row group contains 10000 rows and is configured by orc.row.index.stride.

For each stripe, the list of disk fractions is selected to read based on selected columns and selected rowIndexGroups. This list is sorted and then the actual disk reads are performed.

The key understanding here is that the default ORC filters are executed on metadata only and based on that, data is read from disks and transferred over the network.

Below is the Orc File Layout Diagram

Extended Predicate Pushdown in ORC with Spark (New Approach)

As the default predicate pushdown is evaluated only on metadata and not on data, They are limited by the capabilities of ORC file stripe metadata which is composed of min/max and bloom filter. The min/max works mostly if the file is primarily sorted on that column, which is generally limited by only 1 or 2 columns because in the majority of use cases sorting after 1 or 2 columns is non-effective. The bloom filter is useful only for EQUAL conditions and when the values have high cardinality and are non-conflicting over rowIndexGroup. If a filter contains IN values, More than 20 values in a filter are generally high enough to select more than 70% of actual data, if the bloom filter FPP is 5%.

To support Generic SQL filter pushdown to ORC, part of which is first evaluated in using metadata and then full SQL filter is evaluated using data, we have added some property in Orc Conf including (orc.enable.datafilter, orc.predicate.colomnsto.remaining.ratio, orc.sql.filter) and implemented data-based filtering support to ORC reader. We have further extended it to support spark UDF also. The SearchArgument filter is first evaluated on metadata as it is. Then after selecting the rowIndexGroup, instead of reading the data from the Disk for all projection columns, we divide the projection columns into two parts for data reading Predicate Columns and Remaining Projection Columns. We first apply filters using metadata and then apply data-based filtering on actual data for predicate columns and the remaining projection columns are read for selected rowIndexGroups (reduced) only.

The below table summarizes what are the different predicates supported by the default ORC filter pushdown implemented by Spark/Presto(Trino) etc.

Limitations of Default ORC Filter Pushdown on Metadata

Default ORC Predicate pushdown works great for most cases but in larger deployments (PB+) when It is necessary to optimize disk reads and network bandwidth, It is necessary to skip irrelevant reads whenever possible. At Airtel, our endeavour is always to achieve the best performance and save resources beyond the default framework supports.

When ORC tables have lots of columns and Query contains a lot of columns (like select * query) but predicate selectivity limits to very few rows, It makes sense to skip irrelevant data read for remaining columns. If the predicate is applied to the data of the predicate columns, the data read from the remaining columns will be much smaller.

The default ORC filter pushdown supports only filters given in the table above and all other filters above that are skipped.

In those cases where predicate selectivity is less (If the query is returning limited data), and the number of an output column in projection is more, The new approach of predicate pushdown mentioned here performs best and skips huge disk reads and reduces network I/O. It supports all SQL filters.

The below table summarizes the Predicate pushdown supported by EPPD.

Let’s go through the benefits using an example of ORC file containing around 100’s of columns. Assume the weightage of data per column from 0% to 2–3%. The data volume is high.

Scenario 1: Like Query

SELECT * FROM big_table WHERE ip LIKE '97.134.%';

Default ORC predicate push down will read the whole data from disk, The New EPPD approach will only read only 0-3% of the whole data. It saves PetaBytes of disk read in case of Large deployments.

Scenario 2: IN Query even with Bloom Filters

SELECT * FROM big_table WHERE number IN(list of 20 numbers);

Even working with Bloom filter where FPP is around 5% when selecting 20 IN List Items FPP rate goes as high as 70%. With the Default ORC Predicate pushing down 70% of data will be read. The New EPPD will read around 2% of the Data. (70% of 1 column + selective data)

Scenario 3: IN Query with Large IN Lists

SELECT * FROM big_table WHERE VALUE_EXIST_IN_FILE(value);

The Default ORC Predicate push down 100% data will be read. The new EPPD will read around only 1% of Data.

Scenario 4: Specific Queries

SELECT * FROM big_table WHERE port=9668;

In the above query where a rare destination port is used and bloom filters were not feasible for such a column, The Default ORC Predicate push down 99% data will be read. The new EPPD will read only 3–4% of data.

Below are 2 Screenshots from Spark Job run for the given below Query, where the total file size is ~400MB and Bloom Filter is created on phone_number column. The output is only a single row as 19 entries of phone_number do not exist in this file and exist in some other file. When we run this on whole year data, It will be multiplied by the of files. The read size for default ORC PPD is 370 MB per file. The read size for EPPD is 26 MB.

SELECT * FROM big_table WHERE phone_number IN (list of 20 no's);

It can save lots of resources in Large deployments of ORC ( PB+ ) on Telecom / Transactional / Historical / Archival data.

Previous Links in The Series :

Designing and Scaling a PetaByte Scale System — Part 3: Even More Optimized ORC

At Airtel, when it comes to data storage, we have unique scalability challenges, As we receive 40’s PB (5 PB…

medium.com

Designing and scaling a PetaByte Scale System — Part 2

Challenges Faced while using Object Storage with MapReduce:

medium.com

Designing and scaling a PetaByte Scale System — Part 1

Our single application required storing 6–7 PetaByte of GZip Compressed Data Annually and we needed to store 6 years of…

medium.com

Designing and Scaling a PetaByte Scale System — Part 4: Extended Predicate Pushdown in Spark with Apache ORC

Designing and Scaling a PetaByte Scale System — Part 3: Even More Optimized ORC

At Airtel, when it comes to data storage, we have unique scalability challenges, As we receive 40’s PB (5 PB…

Designing and scaling a PetaByte Scale System — Part 2

Challenges Faced while using Object Storage with MapReduce:

Designing and scaling a PetaByte Scale System — Part 1

Our single application required storing 6–7 PetaByte of GZip Compressed Data Annually and we needed to store 6 years of…

Written by Dilip Kasana