How to choose an appropriate file format for your data pipeline

Montadhar REKAYA
7 min readMar 24, 2022

--

Introduction

Have you ever asked the question of which file format to use for data storage when building your data pipeline?

When and why use parquet rather than AVRO ? or ORC? or other?

How the file format will contribute to global performance for your data pipeline?

Your data source can be in any format (CSV, JSON, XML….) and need conversion to a suitable format to your specific use case

In this article, I’m going to share the advantage and disadvantages of each format to achieve optimal performance and fit the business requirement.

How data is flowing?

Let’s start by the following schema illustrating how data will flows inside your pipeline.

Basically, you will receive data from different data sources in multiple formats. this is the raw data preceding any kind of processing such as :

  • cleaning
  • deduplication
  • quality checks
  • transformation
  • aggregation

the processed data will be then used by various application layers such as building reports/ making further analysis and feeding it to the pipelines for a machine learning layer.

As a matter of fact, choosing the format depends strongly on your use case.

Key considerations

before diving into file formats let’s introduce the following consideration to evaluate:

  1. row or columns

Column-based storage is most useful when performing analytics queries that require only a subset of columns examined over very large data sets. If your queries require access to all or most of the columns of each row of data, row-based storage will be better suited to your needs.

Let’s consider the following sample used in our data

event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session

2019–12–01 00:00:00 UTC,remove_from_cart,5712790,1487580005268456287,,f.o.x,6.27,576802932,51d85cb0–897f-48d2–918b-ad63965c12dc2019–12–01 00:00:00 UTC,view,5764655,1487580005411062629,,cnd,29.05,412120092,8adff31e-2051–4894–9758–224bfa8aec18

Row Data

The data contains basically user transactions in an online eCommerce shop. every row in the file has the above columns.

  • stored and retrieved one row at a time
  • best suited for an online transaction system

But applying operations to the entire datasets, making aggregation and analysis is a compute and time expensive job.

Column data

the same data above in column format will be stored as follow:(each column is put in a separate line for simplicity)

2019–12–01 00:00:00 UTC,2019–12–01 00:00:00 UTC;remove_from_cart,view;5712790,5764655;1487580005268456287,1487580005411062629;,,;f.o.x,cnd;6.27,29.05;576802932,412120092;51d85cb0–897f-48d2–918b-ad63965c12dc,8adff31e-2051–4894–9758–224bfa8aec18

data is stored sequentially by column, from top to bottom — not by row, left to right

and it’s best suited for:

  • performing operations on the entire dataset
  • computation is easy for specific columns(ie: is much easier for instance to evaluate the average cart of the entire month or the highest sales revenue day using column data compared to row data since the only date and sales amount will be fetched

2. Split files

Datasets are generally composed of multiple files, each file may contain thousands to millions of records or more. These file-based chunks of data are often being generated continuously.

Processing such datasets efficiently usually require breaking the job up into parts that can be farmed out to separate processors.

In fact, large-scale parallelization of processing is key to performance. if each file in your dataset contains one massive XML structure or JSON record, the files will not be “splittable”, i.e. decomposable into smaller records that can be handled independently

3. Schema evolution

Typically when the schema changes, there are a few key questions to ask of any data format:

  • How easy is it to update a schema (such as adding a field, removing, or renaming a field)?
  • is it backward compatible?

4. Same data, multiple formats

it’s possible that different workloads may require different file formats and storage structures for better performance considering cost decreased storage within your cloud provider

File Formats

1- Apache Parquet

Launched in 2013, Parquet was developed by Cloudera and Twitter (and inspired by Google’s Dremel query system) to serve as an optimized columnar data store on Hadoop. Because data is stored by columns, it can be highly compressed and splittable (for the reasons noted above).

Parquet is commonly used with Apache Impala, an analytics database for Hadoop. Impala is designed for low latency and high concurrency queries on Hadoop.

The column metadata for a Parquet file is stored at the end of the file, which allows for fast, one-pass writing. Metadata can include information such as data types, compression/encoding scheme used (if any), statistics, element names, and more.

Parquet is especially adept at analyzing wide datasets with many columns.

Each Parquet file contains:

  • binary data organized by “row group.”
  • For each row group, the data values are organized by column.

This enables the compression benefits that we described above.

Parquet is a good choice for read-heavy workloads.

Generally, schema evolution in the Parquet file type is not an issue and is supported. However, not all systems that prefer Parquet support schema evolution optimally. For example, consider a columnar store like Impala. It is hard for that data store to support schema evolution, as the database needs to have two versions of the schema (old and new) for a table. (3)

2- Apache ORC

Optimized Row Columnar (ORC) format was first developed at Hortonworks to optimize storage and performance in Hive, a data warehouse for summarization, query, and analysis that lives on top of Hadoop.

Hive is designed for queries and analysis and uses the query language HiveQL (similar to SQL).

ORC files are designed for high performance when Hive is reading, writing, and processing data. ORC stores row data in columnar format.

This row-columnar format is highly efficient for compression and storage. It allows for parallel processing across a cluster, and the columnar format allows for skipping of unneeded columns for faster processing and decompression.

ORC files can store data more efficiently without compression than compressed text files. Like Parquet, ORC is a good option for read-heavy workloads.

This advanced level of compression is possible because of its index system. ORC files contain “stripes” of data or 10,000 rows. These stripes are the data building blocks and are independent of each other, which means queries can skip to the stripe that is needed for any given query. Within each stripe, the reader can focus only on the columns required.

The footer file includes descriptive statistics for each column within a stripe such as count, sum, min, max, and if null values are present. ORC is designed to maximize storage and query efficiency.

According to the Apache Foundation, “Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.” Similar to Parquet, schema evolution is supported by the ORC file format, but its efficacy is dependent on what the datastore supports. (5)

3- Apache AVRO

Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that is highly splittable.

The innovative, key feature of Avro is that the schema travels with data. The data definition is stored in JSON format while the data is stored in binary format, minimizing file size and maximizing efficiency.

Avro features robust support for schema evolution by managing added fields, missing fields, and fields that have changed. This allows old software to read the new data and new software to read the old data. a critical feature if your data has the potential to change.

We understand this intuitively. as soon as you’ve finished what you’re sure is the master schema to end all schemas, someone will come up with a new use case and request to add a field.

This is especially true for big, distributed systems in large corporations. With Avro’s capacity to manage schema evolution, it’s possible to update components independently, at different times, with a low risk of incompatibility.

This saves applications from having to write if-else statements to process different schema versions and saves the developer from having to look at old code to understand old schemas.

Because all versions of the schema are stored in a human-readable JSON header, it’s easy to understand all the fields that you have available. Avro can support many different programming languages. Because the schema is stored in JSON while the data is in binary, Avro is a relatively compact option for both persistent data storage and wire transfer. Avro is typically the format of choice for write-heavy workloads given its easy to append new rows.(1)(4)

Which format to choose?

Depending on the nature of your data set and analytic objectives, you will likely value some of those features more than others.

Because all of these formats are Apache open source projects, they are constantly being updated to support new features and functionality. It’s always worthwhile to consult the latest release for specifics.

Source(1)

source:

The first determination will likely be if your data is better suited to be stored by row or by column.

Transactional data, event-level data, and use cases for which you will need to leverage many columns are best-suited to row-based data.

If that’s the case, Avro is likely the best choice. Avro is typically the choice for more write-heavy workloads since its row-based format is easier to append (similar to a traditional database structure).

However, if you know your data is best suited to a columnar format, the question will become Parquet or ORC.

In addition to the relative importance of split ability, compression, and schema evolution support, consideration must be given to your existing infrastructure. ORC maximizes performance on Hive.

Both formats offer benefits and it will likely come down to which system you have access to and are most familiar with.

Columnar formats are the choice for read-heavy workloads.

If you find the article helpful give it a clap 👏 and follow me on LinkedIn for more upcoming about Data Engineering and Machine Learning

--

--