Designing and Scaling a PetaByte Scale System — Part 3

Dilip Kasana
Airtel Digital
Published in
4 min readJul 17, 2020

At Airtel, when it comes to data storage, we have unique scalability challenges, As we receive 40’s PB (5 PB compressed) per year as Data CDR’s, with an incoming daily rate of 200 TB (25 TB compressed).In the last few years the incoming CDR growth rate increases significantly and expected to grow with a rate more than 25% year.Given the rate of increment, Storage efficiency is always a focus to our engineering.

As the data we store is queried multiple times, we need to store the data to best queryable columnar format ORC with maximum compression. While using ORC with spark, we need ORC with the best compression algorithm but hit the below challenges.

  1. ZSTD support with Spark : As Zstandard is a maximum real-time compression algorithm available currently, providing high compression ratios.We need best compression algorithm zstd to be used while storing data in warehouse but we hit by the limited backdated ORC support with Cloudera, So we decided to integrate the raw spark ORC support with Cloudera Spark,Then we hit the limitation of backdated ORC version used by spark.Then we added different Reader by modifying default Spark’s ORC Reader to use latest ORC 1.6.3.
  2. Intelligent Ordering the Data : Apart from compression we have sorted our data by the columns in such an order that maximum column transitively depend on that column as like partition columns first.This gives the best compression possible over data.This not only reduced the size of the data but also have outstanding advantage while applying filter push-down by skipping Row-Indexes and Stripes.As the sorting of file data having advantage only in Row-Index skipping and Stripe Skipping.We have Range partitioned the data on the high occurring Predicate column that is mostly used as Join Key of the data before above sorting.This gives the File Skipping advantage while querying the data with the benefit of Large files.
  3. Optimizing the ORC File Metadata : Apart from compression of data we need to store additional metadata as BloomFilters for fast query over data. As mobility records are having well defined data types, we have very high number of rows in small data. Applying Bloom Filters over this data increases the size of data significantly. It increases the data size as well as an impact while querying the data, so optimizing bloom filter is important.We ran many experiments to identify heuristics that maximize compression over this metadata without impacting ORCFile read-write performance.The key change we did by identifying actual cardinality of column per ORC File Row Index, and assign this as no of bits to ORC BloomFilter. We did a small change in ORC Writer Bloom Filter to achieve this. This reduced the metadata size drastically.
  4. Caching The ORC File Stripe Metadata : We have huge data across single data source.The no of files and stripes are proportionally.It takes huge amount of time to lookup for selective predicate of a query in data. As the metadata is significant part of the data at per Row Index Level, We have extracted the Stripe Level metadata of ORC file with min, max and bloomfilter and Hot Cached this metadata of filename,min,max and bloomfilter. This boosted the read performance multiple time by skipping the Random Read of ORC file while Filter pushdown without listing all files, opening the files, reading the stripe metadata and decoding the metadata of the file. This also moves the file listing complexity of every query. As we store the data in Object Store, It removes list file s3 requests, File Footer s3 requests,Stripe Footer S3 requests to almost zero.

Below is the percentage of Optimization by Sorting, Compressing and Optimized Bloom Filter.The Normal ORC is already compressed with snappy compression.

+----------------------------+-----------------+
| Description | Total Data Size |
+----------------------------+-----------------+
| Uncompressed Input Data | 500 GB |
| GZIP Compressed Input Data | 71 GB |
| Snappy Normal ORC | 74 GB |
| Snappy Sorted ORC | 32 GB |
| ZSTD Sorted ORC | 24 GB |
+----------------------------+-----------------+

As the intelligent sorting of the data reduced the Data size 55% over the Normal ORC file.Applying ZSTD compression improved it further 40% over the sorted orc.This overall provided 2/3rd size reduction over the Normal ORC.

+--------------------------------------+-----------------+
| Description | Total Data Size |
+--------------------------------------+-----------------+
| ZSTD ORC with Bloom Filter | 38 GB |
| ZSTD ORC with Optimized Bloom Filter | 24.5 GB |
+--------------------------------------+-----------------+

The optimized Bloom Filter provided 33% reduction in size over default ORC Bloom Filter.

Read performance

Turning to read performance, one feature that we quickly noticed a need for ORC file metadata optimization. Consider a query that selects many columns when performing a very selective filter on one column. Without metadata optimization all the files are listed, all files are opened, all files ORC file footer read, all stripes data read.As we have large files and large ORC file stripes , Ideally, only some of the files/few stripes that pass the filter would be considered and decoded so the query isn’t spending a majority of its time evaluating the metadata for the files that never gets used.

To support this, we extracted the ORC file metadata in much optimized way and then instead of custom change in the ORCFile reader, We only configured ORC file reader to read selective stripes and For a query evaluate the Min,Max and bloom filter using the predicate passed in the query over the Hot Cached Metadata. In the case described above, only the files the filter is satisfied will be further read by ORC Reader.

Summary

By applying all these improvements, we evolved ORCFile to provide a significant boost in compression ratios on our warehouse data, down to 1/3rd of actual data. Additionally, on a large representative set of queries and data from our warehouse, this boosted query performance 3x better in general case and 20x in case of filter selectivity is less.

--

--