Big data often comes with big costs: storage, bandwidth, compute — none of them come cheap. As a big data system grows it’s often worth setting some time aside to explore new approaches to old problems to look for efficiencies you might have missed before. For us that recently came in the form of a dive into the
.parquet file format.
Apache Parquet is an open source column oriented data storage format. It aims to offer efficient data compression and access. Originally a joint project out of Twitter and Cloudera, Parquet has now become a top level Apache Software Foundation project with a vibrant community of users. Hearing positive reviews we decided to dig in and see what it had to offer for our use case.
So what were we hoping to get out of this? At the time our standard of storage was JSON record format. Row after row of serialized JSON objects in file after file, beautiful in its simplicity, the Nokia phone of data storage, but alas dreadfully unperformant. Unlike standard web application development, in data systems columns are generally far more interesting than rows. I rarely care about a specific row, instead I care about the values of a handful of columns taken in aggregate. Row based storage necessitates reading the entire file to find the fields I care about and separates similar values between rows reducing the ability to apply effective compression. Columnar storage allows us to quickly pull just the parts of the file we are interested in investigating and keeps like values close together where compression can be done effectively. It’s this quick access and effective compression that we’d hoped would win the day.
There are a few libraries available to start writing
.parquet files in Python,
fastparquet being two of them. Each library has its advantages and disadvantages.
fastparquet has a much smaller footprint, weighing in at a modest 180 kB compared to
pyarrow‘s hulking 48 MB. It also has a much simpler and
pandas centric read/write signature which can be nice for users that are more comfortable working with a DataFrame mindset. On the other hand,
pyarrow offers greater coverage of the Parquet file format, more voluminous documentation, and more explicit column typing which turned out to be important later on. After much wavering we decided to go forward using
We selected two of our largest and most extensively used data sets to convert to Parquet. As we began working with
pyarrow we found that getting from nothing to something was pretty easy, we were able to start writing Parquet files to our data lake within a few hours. However getting from something to the right thing for our system proved more difficult. We use Presto as our SQL layer to our data lake with AWS Glue serving as the hive metastore and this is where we started to run into problems.
Problem #1: Just because you name your columns doesn’t mean Presto will use them.
When creating a Parquet file with
pyarrow our first impulse was to name our columns.
When you read this file back in, the names provide a natural way to access specific columns. However this doesn’t come so naturally to Presto. By default when Presto is given a parquet file it assumes that the columns in the file are in the same order as they are specified in the metastore. I’m sure this provides some efficiency somewhere, but it confused the heck out of us when column A decided to get out a wig and make-up and start masquerading as column B. The solution to this is fairly simple, once you realize what the problem is. Within Presto’s configuration add:
hive.parquet.use-column-names = true
This got us over one hurdle and onto the next.
Problem #2: Parquet/Presto cares deeply about types
When using JSON record format Presto has no other option than to be somewhat forgiving with types. JSON just isn’t capable of conveying all that much type information. But Parquet on the other hand — that’s ripe with types and Presto cherishes every bit of it. If a column is saved as a
float32 then it had better be declared as a
FLOAT in Presto. If you try to sneak by with a
DOUBLE you can be sure that Presto will be sitting you down for a firm talking to.¹
The overt solution to this was pretty simple: get picky with types. But as much as a programmer might be loathe to admit it, coding remains a deeply human problem and demands solutions that tolerate the best efforts of this largely water-based part of a software system. To this we apply our standard approach, wrapping the wonderfully expressive in the dead simple.
pyarrow has a lot of features, most of which we don’t need and just provide opportunities for accidents, so we wrote a small library that wraps
pyarrow's Parquet functionality, trims down the available types to the ones we plan on using, and optimizes the call signatures for our specific use case. If we accidentally cut away too much we’re always free to just use
pyarrow, but in our experience that pattern has been generally helpful in speeding development and reducing errors.
Problem #3: When it comes to dates, older is better
We’ve got Presto using our column names, we’re minding our types, and it looks like we can read all our columns so we’re good to go! That’s at least what we thought, but alas, no such luck. We actually fully launched Parquet into production only to have it surprise us one last time. The morning after launch we awoke to alerts on some of our overnight ETLs. While we were able to read all of our date fields on a simple SELECT, if we tried to perform any operations on those dates or use them in the WHERE clause it would trigger an error.
After much digging, and a touch of wild speculation we tracked our problem down to a single flag.
# Read on to see a better way
use_deprecated_int96_timestamps, now that is a parameter that hurts to type. Leave it out and dates trigger failures, put it in and everything is peachy. Note however that we don’t include this parameter in the first snippet of this post.² The nice folks over at
pyarrow provide another option that is a bit less offensive to one’s sensibilities.
flavor='spark', # The important parameter
flavor='spark' it enabled the correct timestamp formatting to maintain compatibility with Spark which appears to be what’s needed to maintain compatibility with Presto.
Results: Reduced bandwidth, reduced processor time and reduced wall time
There was definitely some shin banging along the way, but once we got the details ironed out we were very pleased with the results. We saw our overall bandwidth drop by over 60%, that’s not bandwidth utilization on this particular data set, rather bandwidth across our entire system.
For queries to these data sets we saw wall time drop by about 20% and CPU time drop by over 80%. That’s a lot to be happy with.
Parquet has a lot of upsides, but are we going to migrate our entire system over to it? No, definitely not, but not because it’s bad. One of the general benefits that we get from using object storage with Presto as an SQL layer is that the storage is laid bare to engineers. If Presto just isn’t working out for some reason or another on a given task, I don’t need to use it because I understand the storage that backs it and can use that instead.
In the same fashion there is a beauty to using JSON record files. When everything is going south and I don’t understand what is going on, I know I can at least understand those files. As long as an engineer knows how to use
head they know how to start pulling at the threads that will lead to a solution. I can’t say the same for Parquet.
Instead our approach with Parquet is to use it as a sort of booster. It isn’t our primary copy of any data, but we might use it as a performant replica. If we have a large and wide data set that we’re expecting will get heavy use, we would definitely want to maintain a copy in Parquet. With Parquet, when you’re on the happy path you’re really cruising.
 It’s not angry, it’s just disappointed.
 I have a strong moral conviction that when snippets are provided in a blog post they should work. I’ve been burned far too many times by examples of what not to do for failing to read the lines above them.