What is Parquet?

Learn all about the Parquet file format.

Parquet is a highly compressed file format for storing tabular data that is used widely in data engineering use cases.

Example of a Parquet data set

The first time many developers encounter Parquet is when they want to access analytical data sets. Many example materials use publicly available Parquet files to get a data set into a database for querying. One of the more famous examples is the TLC Trip Record data set, which stores historical trip records of New York City taxis and other vehicles. It’s an important data set to play with, but be wary that the size of some of the datasets can be quite large!

Where do you see Parquet used in modern data stacks?

Parquet uses storage and transport mechanism for large amounts of data between different systems. (You’ll often hear data engineers refer to this process as “ETL” — Extract, Transform, Load.) For example, it’s common to see Parquet files stored on cloud storage platforms like Amazon S3. Parquet was typical for Apache Hive users — an open-source data warehouse — but commercial offerings like Snowflake’s data warehouse also support loading and unloading Parquet files to and from cloud storage. Additionally, Apache Iceberg — an open-source table format we touched on previously — is built entirely around Parquet files in cloud storage.

What’s the difference between Parquet and a CSV file?

If you are familiar with CSV (Comma Separated Values), a text file format where tabular data is stored, you might see little reason for Parquet to exist. CSV is named precisely for how it works: different values are separated by a comma (,), and each file line usually represents a row of tabular data. There are many variations on CSV files, such as different separators like the pipe (I) and conventions such as treating the first line of values as headers, but they behave similarly.

Why is Parquet so suitable for storing large amounts of data?

Parquet files get compressed when written using a variety of algorithms. For example, suppose different columns share the same values (like lots of Orders in sales data having the same value for Country). In that case, Parquet will store each unique Country value only once. A count of its occurrences and a reference to its locations are known as dictionary encoding. Instead of the same values getting stored over and over as a CSV would, Parquet can achieve much smaller file sizes. This also means that writing Parquet is relatively more expensive than simpler file formats. Parquet uses different algorithms to encode data types, such as strings vs. integers. It also means that as new, better algorithms are developed, these can be added to Parquet to make it even better.

How can I read a Parquet file?

Parquet has broad support across programming languages and tools, making it easy to work with these implementations available for all major languages. Most big data tools support importing and exporting Parquet. If you want to work quickly with Parquet, a good example is pqrs on Github by Manoj Karthick. It’s a command line tool for reading Parquet and outputting the data as CSV or JSON. For example, using Karthick’s tooling, it’s easy to view the schema of a Parquet file:‍

❯ pqrs schema data/cities.parquet
Metadata for file: data/cities.parquetversion: 1
num of rows: 3
created by: parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})
message hive_schema {
OPTIONAL BYTE_ARRAY continent (UTF8);
OPTIONAL group country {
OPTIONAL BYTE_ARRAY name (UTF8);
OPTIONAL group city (LIST) {
REPEATED group bag {
OPTIONAL BYTE_ARRAY array_element (UTF8);
}
}
}
}

How does Propel use Parquet?

At Propel, we equip our customers to build customer-facing analytics use cases without the requirement to operate any complex infrastructure. You may already have a bunch of analytical data sitting as Parquet files in AWS S3, or see how your data as it sits today can be done. Propel can treat a set of Parquet files sitting in AWS S3 as a Data Source. You can authorize access to an AWS S3 bucket containing Parquet files, and Propel will ingest them, making them ready to serve blazingly fast analytics experiences. You can learn more about using AWS S3 with Propel by visiting the documentation.

Conclusion

Parquet is a columnar, binary file format that offers many advantages over row-oriented, textual file formats like CSV. It is designed for big data applications, where storage and bandwidth are at a premium. Parquet can be read and written in many different programming languages, which makes it very flexible. And its performance is superior to many other data formats. If you’re looking for a better way to store your data, Parquet is worth considering.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dave Ganly

Head of Product @Propel Data. Ex @Twilio product director. Ex @Hailo product director. Ex @unrulyco head of product.