Analytics Vidhya
Published in

Analytics Vidhya

Big Data File Formats Explained Using Spark Part 1

Understand How Avro, Parquet & ORC Work

Image Source: https://www.ellicium.com/orc-parquet-avro/
Figure 1: Shows a simple sql query performed using CSV, Parquet and ORC file formats. ORC was around 10X faster than Parquet and 20X faster than CSV!

Big Data Formats

  • Row or Column Store ( R )
  • Compression ( C )
  • Schema Evevolution ( E )
  • Splitability ( S )

Row Vs Column Store

Table 1: Shows the top five all time point leaders in the NBA as of 1st December 2019. Data extracted from nba.com.
Figure 2: Demonstrates how data is stored in row-based vs column-based storage formats. In row based formats, data is stored row by row, from left to right. Columnar formats store data column by column, in sequence from left to right.
Figure 3: High level demonstration of how data-skipping works.

Compression

Schema Evolution

Splitability

Avro

Parquet

Optimized Row-Columnar (ORC)

Figure 4: Shows how ‘Stripes’ are used to group together data and then store it in columnar format in ORC. The stripe footer contains metadata about the columns in each stripe which is used for data-skipping. Source: Nexla Whitepaper
Figure 5: Summary of the 4 core properties to look for in the big data formats discussed along with their compatibility for different platforms. Source: Nexla Whitepaper.

References

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store