Big Data File Formats Explained Using Spark Part 1

Understand How Avro, Parquet & ORC Work

Akeel Ahamed
Analytics Vidhya
7 min readDec 4, 2019

--

Image Source: https://www.ellicium.com/orc-parquet-avro/

When dealing with large datasets, using traditional CSV or JSON formats to store data is extremely ineffecient in terms of query speed and storage costs.

Figure 1: Shows a simple sql query performed using CSV, Parquet and ORC file formats. ORC was around 10X faster than Parquet and 20X faster than CSV!

Figure 1 demonstrates the power of using the right file format to query the data. We see that the ORC format was nearly 20X faster than using CSV! The dataset used here is the taxi trajectory dataset from Kaggle and is just over 1GB in size and contains roughly 1.7 million records (full details of the dataset can be obtained here).

Hopefully that has piqued your interest to learn more about the right file formats for storing big data. This article will discuss the 3 main file formats optimized for storing big data sets and then demonstrate the query speed using Spark on the taxi trajectory dataset. This will be done on both Azure (using Databricks) and Google Cloud Platform (using Dataproc) so you can try it on the platform of your choice.

Big Data Formats

The big data world predominantly has three main file formats optimised for storing big data: Avro, Parquet and Optimized Row-Columnar (ORC). There are a few similarities and differences between each of these formats mentioned below.

Similarities:

· All three formats store data in machine readable binary format which means that it can only be understood by a machine, unlike CSV and JSON formats which are human readable.

· Datasets can also be split across multiple discs enabling for large scale parallel data processing. This considerably increases processing speed

· They are self-described formats: One copy of a parquet file can be easily transferred onto another machine without loss of interpretability.

· They are on-the-wire formats: Can easily pass data between nodes in a cluster.

Differences:

· Parquet and ORC store data in a columnar format which means that data is optimized for fast retrieval. This is ideal for read-heavy analytical workloads i.e. queries that use only a few columns for analysis or ones with complex aggregations.

· Avro is a row-based data store which means that data is optimized for “write-heavy” workloads i.e. queries that need to display (write) most or all of the row data.

Although not comprehensive, there are 4 main properties to consider when deciding which format is right for your dataset:

  • Row or Column Store ( R )
  • Compression ( C )
  • Schema Evevolution ( E )
  • Splitability ( S )

A mnemonic to help you remember them is RaCES.

Row Vs Column Store

Table 1: Shows the top five all time point leaders in the NBA as of 1st December 2019. Data extracted from nba.com.

Row stores store the data row by row. Using the dataset in table 1 as an example, a row-based storage looks like the following:

Kareem Abdul-Jabbar, 1560, 57446, 38387, Karl Malone, 1476, 54852, 36928, Kobe Bryant, 1346, 48643, 33643, LeBron James, 1217, 46895, 33031, Michael Jordan, 1072, 41010, 32292.

data is stored row by row, from left to right. In contrast, a column-based data store looks like the following:

Kareem Abdul-Jabbar, Karl Malone, Kobe Bryant, LeBron James, Michael Jordan, 1560, 1476, 1346, 1217, 1072, 57446, 54852, 48643, 46895, 41010, 38387, 36928, 33643, 33031, 32292.

Data is stored column by column, in sequence from left to right. The sequence can be summarized as shown in figure 2.

Figure 2: Demonstrates how data is stored in row-based vs column-based storage formats. In row based formats, data is stored row by row, from left to right. Columnar formats store data column by column, in sequence from left to right.

We can now begin to understand why column-based storage is ideal for analytical purposes where queries tend to be isolated on only specific columns rather than the entire dataset i.e. read-heavy. This columnar storage enables formats like Parquet and ORC to store metadata at the end of each column (containing information such as schema, column min and max values, null values etc) which is not possible for row-based storage. Using this information, it is much faster to skip to the relevant column needed by the query as the metadata will inform the query on whether the relevant data is present in that column or not.

For example, suppose we have a table of products and want to find out which product has the highest revenue, grouped by the Product ID. Thus, you may need to query (and perform aggregations on) specific columns in the product table to extract the information. Having stored the data in Parquet or ORC, data skipping will be performed to scan only the relevant columns (see Figure 3).

Figure 3: High level demonstration of how data-skipping works.

However, if queries need to return all or most rows from a dataset, then it is more efficient to use the row-based storage instead.

For example, suppose you want to book a flight for the next day. Then you will want to see most/all the row data corresponding to the flights available for that day and most/all of the columns as each column adds more information in order for you to decide which exact flight you will take. In this case, you may want to use a row-based format like Avro which is optimized for this.

Compression

Compressing data reduces the amount of information needed to store,query and transmit data, saving both time and money.

Data stored in columnar formats achieve better compression because data of similar types are stored next to each other. For example, taking the dataset in table 1, all string valued data (from Column 1) is stored together and all integer valued data (from columns 2–4) are stored together thus enabling data to be compressed more efficiently as compared to row-based data storage. This will reduce storage costs.

Schema Evolution

“Schema” in the context of a dataset refers to the column header and type. As a project matures, there may be a need to add/alter (new) columns to the dataset thus altering its schema. All 3 formats discussed support some level of schema evolution support although Avro is far superior in this compared to the other two formats.

Splitability

“Splitability” refers to the ease with which big data sets can be broken down into smaller independent chunks so that they can be processed by multiple nodes running in a cluster. This enables large scale parallel processing as one machine will not have enough compute power to process the entire dataset. Avro, Parquet and ORC formats are all optimized to support splitability.

Let us now explore the three file formats in more detail.

Avro

Avro was developed in 2009 by a team at Hadoop and is a row-based storage format that is highly splittable. The distinguishing feature in Avro (as mentioned in the Nexla whitepaper) is that its schema ‘travels with the data’. This is because the data definition/schema is stored in human readable JSON format whilst the data itself is stored in binary format. This maximizes compression and storage efficiency. Furthermore, since the data definition is stored in human readable format, this makes schema evolution extremely easy in terms of modifying column types or adding new ones.

Parquet

This format was developed by Twitter and Cloudera in 2013 and is a columnar based format. Since the data is stored in columns (as explained before) it can be easily compressed. Parquet also stores metadata at the end of the files containing useful statistics such as schema info, column min and max values, compression/encoding scheme etc. This enables data skipping and allows for splitting the data across multiple machines.

Optimized Row-Columnar (ORC)

Figure 4: Shows how ‘Stripes’ are used to group together data and then store it in columnar format in ORC. The stripe footer contains metadata about the columns in each stripe which is used for data-skipping. Source: Nexla Whitepaper

ORC was developed by Hortonworks in 2016 and stores row data in columnar format which is extremely efficient for compression and storage. This compression is achieved by ORC’s “index” system which stores data in ‘stripes’ of roughly 10,000 rows. Each stripe groups the Index (Column headers), Row Data and Stripe footer (contains column stats, schema info etc) in to separate columns as shown in figure 4. This makes it possible for data-skipping to occur as a query to skip to the relevant stripe for analysis. Stripes are also independent of each other which allows for data splitting as one stripe can be transferred onto another node without loss of interpretability.

A summary of the four main properties to look for when choosing a data format are listed in figure 4 along with their suitability in Avro, Parquet and ORC.

Figure 5: Summary of the 4 core properties to look for in the big data formats discussed along with their compatibility for different platforms. Source: Nexla Whitepaper.

Click here for Part 2 of this article which shows a demonstration of the query speeds achieved using these effecient file formats on both Azure & GCP.

References

· Nexla Whitepaper: An Introduction to Big Data Formats.

· Datanami: Big Data File Formats Demystified.

--

--