Sorting and Parquet

Pankaj Gupta
4 min readJul 20, 2018

--

https://parquet.apache.org/ is a columnar data format that has gained a lot of popularity and for good reason. The biggest advantage is projection pushdown: only read data for columns that your query needs. Another advantage is better compression: data in a column is of the same type and hence compresses much better, e.g. Delta Encoding is very effective for integer columns. Another major advantage is predicate pushdown, which is slightly harder to get full advantage of and is the topic of this post.

Predicate Pushdown

A predicate is basically a condition in a query, e.g. the where clause here:

SELECT * FROM Customers WHERE Country=’Mexico’;

The brute force way of applying a predicate would be to read all the rows of data and filter out the ones that don’t match the condition. This can be grossly inefficient: imagine in the above query if only 1% of customers were from Mexico, the query would have still read 100% of the data, only to discard most of it. We could optimize this if we had some metadata. This is exactly what Parquet does as we’ll shortly study. Predicate pushdown means that the query engine is able to push the condition right into the storage layer and thus reduce the amount of data that is read.

Predicate pushdown is more efficient in couple of ways. Reading data has many costs: disk reads, decompression, network i/o, and deserialization being the major ones. By avoiding reading entire chunks of data all of them are avoided. Even when Parquet has to read the chunks, filtering done at the Parquet layer is often more efficient; in my experiments, I’ve found it to be 10% more efficient when the same filtering is done in Scalding, YMMV. But this second optimization is tiny compared to being able to avoid reading data altogether.

Why Sort?

Parquet predicate pushdown works using metadata stored on blocks of data called RowGroups. e.g. for an integer column, this data may be the maximum and minimum value of that column in that RowGroup. For a string column, it could be a dictionary listing all the distinct values. RowGroups are typically chosen to be pretty large to avoid the cost of random i/o, very commonly more than 64 Mb. For most datasets that means hundreds of thousands of rows. With so many rows you might catch a huge range if the data was not organized, and thus render pushdown ineffective.

Let’s take an example. Imagine a customer table where one of the columns is Country and we have the same query as before:

SELECT * FROM Customers WHERE Country=’Mexico’;

Let’s say we have a dictionary at the RowGroup level where we can check if there is any row in the RowGroup where the Country is ‘Mexico’. If the answer is no then we can skip reading this RowGroup. This could be a big win.

Let’s say every RowGroup has on average 100k rows and let’s say only 1% of users are from Mexico. If the data were not organized, the chance of finding a RowGroup where no Row has Mexico is really small. Let’s take a shot at calculating this.

There is a 99 in 100 chance that a row doesn’t have Mexico. This seems good but we have to multiply this by itself 100k times to get the probability of none of the rows having Mexico i.e (99/100)¹⁰⁰⁰⁰⁰. This comes out to 2.4e-32. This is a really really small number.

So in practice, you will hardly ever avoid reading any RowGroup. But imagine the data were sorted by Country, in that case, you can avoid up to 99% of the RowGroups.

But which column should I sort on?

In short, it depends on the query pattern. You can only sort on one column, so better do it on the column that you’re going to be filtering on most often. If you know that that’s great. But if you don’t, then one idea is to instrument your query system to record statistics. Another idea is to keep a record of all the queries that are run, into a log, and then analyze the queries to understand which column is filtered on the most.

While being able to optimize filtering on just one column seems limiting, it is often much better than not organizing data at all. There is of course the overhead of sorting; it’s not free. There are a few things in favor of sorting though. One is that data is often read many more times than it is written. Another is that sorted data usually compresses much better and thus takes less space. Sorted data can also help with Sort-Merge Joins which deserve whole another post.

Conclusion

Keeping data sorted has numerous advantages and for benefiting from Parquet Predicate Pushdown it is critical. Sorting should be an important consideration in your data architecture.

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

--

--