Benchmarking PARQUET vs ORC

Dhareshwarganesh
4 min readJan 20, 2020

--

In this article, we conduct few experiments on Parquet and ORC file system and conclude the advantages and disadvantages over each other.

BACKGROUND

We all know that, Parquet and ORC both are columnar file storage. Use any compression algorithms to compress huge data and store with very less space. Which means, these are memory optimised.

What is the use of compressing files and storing when we have abundant storage ? (Means, AWS S3 or Azure Blob offer ample storage space with less cost.) But, what is the use of just storing huge files ?

We need to work with these files and extract salient data from it. In order to do this, we need clusters to run Spark Jobs and extract meaningful information. Cluster charges based on “PAY AS YOU USE ” principle. What I want to conclude is, when we use huge files, cluster take more resources to load the file and run queries on it. Substantially, we pay more.

Parquet, ORC is well integrated with all Hadoop ecosystem and extract result pretty faster when compared to traditional file systems like json, csv, txt files.

APACHE PARQUET
In brief, it is machine readable binary format stored in columnar fashion. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It also supports nested structure.

Apache ORC [Optimised Row Columnar]
Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is compatible with most of the data processing frameworks in the Hadoop environment. [copied from https://en.wikipedia.org/wiki/Apache_ORC]

Benchmarking

I would like to divide this experiment in 4 parts.

  1. Time Efficiency for Creating file System
  2. Space Efficiency
  3. Time Efficiency in finding count [Total number of records]
  4. Time Efficiency in running Complex Query

On a higher level,
We create Table [Dataset] with 15 columns. We make use of all the prominent Spark DataTypes in it. [ https://spark.apache.org/docs/latest/sql-reference.html].

We load ’N’ data on it [50000, 1,00,000, 2,00,000 etc ]. We write the same Dataset to Parquet and ORC file and measure Time and Space used.
Subsequently, We run complex query and measure the Execution Time for both Parquet and ORC separately and summarise the results.

Employee Schema :

id IntegerType
uniqueId LongType
firstName StringType
middleName StringType
lastName StringType
address StringType
pincode IntegerType
department StringTypetech”
team StringType
project StringType
sex BooleanType false
ctc DecimalType
monthlySalary DoubleType
createdTime TimestampType
createdDate DateType

To make it realistic, we follow few conventions while loading data to particular columns.

Convention 1 :

Let’s load Random values for few columns

employee.uniqueId = generateLongOfN(10)
employee.firstName = generateStringOfNChar(10)
employee.middleName = generateStringOfNChar(10)
employee.lastName = generateStringOfNChar(10)
employee.address = generateStringOfNChar(30)

These columns populate Random character/numbers. It is high probability that columns don’t have similar values.

Convention 2:

Let’s load Identical values for few columns

It is obvious that, when the data is stored in columnar fashion, repetitive [identical] data in Cells are stored subsequent to each other. So that, Select query becomes faster. Moreover, Identical data are indexed and Cells store index rather than actual data which optimise the space. [Compression and Encoding algorithm]

What I want to conclude is, Parquet and ORC file system has it’s own compression and encoding algorithms to minimise the space. I want to see how efficient and optimised these algorithms when we load our data. Hence, we produce same data few columns.

employee.department = “tech”
employee.sex = false
employee.pincode = 1234XX

Convention 3:

val TEAM_LIST: List[String] = List(“DATA”, “REPORTING”, “REVENUE”, “FINANCE”, “PORTAL”, “BI”)

val PROJECT_LIST: List[String] = List(“DASHBOARD”, “DATA-ANALYTICS”, “PLATFORM”)

Assumption 1 : We will assign TEAM_LIST and PROJECT_LIST randomly from the list. Evidently, many employees inline with same TEAM_LIST and PROJECT_LIST. But, when we join on both the columns, file system should go through all the records and retrieve the matching one. We will compute Time and Space complexity for both Parquet and ORC file format.

System Setup :

We need cluster to run these spark jobs. Even though this experiment is not depending on cluster size and resources.
I would like to mention it’s specification for the benchmarking.

Location : South India
Cluster type, HDI version : Spark 2.3 (HDI 3.6)

Node Type — — — — - node size — — — — — no of nodes
Head node — — — — — 4 Cores, 28 GB RAM — — 2
Worker Node — — — — -16 Cores, 128 GB RAM — — 8

I am making sure that no other jobs are running on this cluster and it is dedicated for this experiment.

LET’S DO IT :

  1. Time Efficiency for Creating file System on Azure Blob

2. Space Efficiency

3. Time Efficiency in finding count

4. Time Efficiency in running complex query

select count(*) from employeeOrc where (team=’REPORTING’ and project=’DATA-ANALYTICS’) or (team=’FINANCE’ and project=’DASHBOARD’) :

Conclusion :

  1. It is clearly visible that, from (1) & (2) , Parquet is taking significantly more Time and Space while creating the file system when compared to ORC. Which indicates that, Parquet Compression and Encoding algorithms are Time and Space consuming.
  2. In return, From (3) & (4), Running query on Parquet is very fast [taking less time] in comparison to ORC.

Limitation :

I am not able to conduct the above experiment for Records > 200000. Creating Employees dataset is throwing ‘java.lang.OutOfMemoryError: Java heap space.’ exception.

Final Statement :

It is evident that, if you allocate MORE SPACE, you get FASTER results. It is up to the users whether you need SPACE EFFICIENCY [select ORC] or TIME EFFICIENCY [select PARQUET].

GitHub Link :

--

--