HDFS Storage Formats — What & When to use?

Siddharth Ghosh
7 min readMay 25, 2022

--

Photo by Mr Cup / Fabien Barral on Unsplash

With the gaining popularity of Hadoop and the ever-evolving Big Data domain, understanding different types of file storage formats are important. In this article, I have tried to summarize all the information I could gather from different sources on the storage formats in Hadoop.

What is the Need?

File formats define the way how data is stored in the HDFS file system. Based on the use case requirement, file storage formats can differ. So choosing an appropriate file format can benefit in terms of storage and performance. With the variety in the data structured or semi-structured or unstructured, data movement and storage can vary and overall has implications on the performance. Below are the factors that can help while choosing file formats :

  1. Read/Write Performance — Different file formats have different read/write performances and applications can be benefitted if data can be read/written faster.
  2. Schema Evolution Support — A change in the schema or the modification of fields can also impact the way we want the data to be stored.
  3. Compression Support — Supporting data compression can help save storage space but the right compression format which can support quick data retrieval is also important.

File Storage formats can be broadly classified into two categories —

  1. Traditional or Basic File Formats — Text(CSV/JSON), Key-Value or Sequence File Format.
  2. Hadoop Specific File Formats — Avro, Parquet, RC(Row Columnar) or ORC(Optimized Row Columnar) Format.

Let us discuss each of the file formats in detail in the below sections.

Text Input Format

Text Input Format is a text-based file in which each row can be represented in each line delimited by a new line delimiter and each field can be delimited based on the specified delimiter.

Pros

  1. Simple and easy to interpret and read.
  2. Compression can be supported but only on the file level(BZIP2).
  3. Commonly used formats are CSV, Text, and JSON.
  4. Files can be easily split based on the new-line delimiter.

Cons

  1. A compressed file cannot be split. First, the file needs to decompress and then can be read which can have a significant read performance cost.
  2. Metadata has to be stored separately and needs to be kept track of.
  3. Limited support for schema evolution as changing the order of fields is not possible but appending new fields at the end of the line is plausible.
  4. Does not support record or block-level compression.

In the case of JSON file format, each record will be a JSON document and it will store data along with its metadata in the JSON document. JSON Files can be split based on the new-line delimiter.

Sequence File Format

A Sequence file is a key-value binary file format that has a structure similar to a CSV file with differences. Each row can be delimited and split but due to its binary format, it is used as intermediate storage. Sequence files are more compact than text files.

Pros

  1. Files can be split and merged easily. Supports merging of small files.
  2. Supports record or block compression.

Cons

  1. Data is not in a readable format.
  2. Limited support for schema evolution as changing the order of fields is not possible but appending new fields at the end of the line is plausible.

Let us understand the schematics of the Sequence file format concerning compression. Let us first briefly look at the header details.

Header

In the header, information related to the version number, key and corresponding values, compression information like if compression is turned ON/OFF or block compression if ON/OFF, what Compression Codec is used? A sync marker to denote the end of the header, block/record groups and metadata information.

No Compression

In case of no compression then record information(record length, key length, key and value), header and a sync marker after some set of records.

Record Compression

In case of record compression then we have a header, record information(record length, key length, key and compressed value) and a sync marker after some set of records. Record compression only compresses the value of the record.

Internals of a Header & Record in a Sequence File

Block Compression

In case of block compression then we store a header, record block information(uncompressed number of records in a block, compressed key lengths block size, compressed key lengths block, compressed keys block size, compressed keys block, compressed value lengths block size, compressed value lengths block, compressed values block size, compressed values block) and the sync marker after every block.

Internals of a Block in a Sequence File

RC (Record Columnar) File

RC File was first introduced in Hive 0.6.0. The table data is stored in the file as binary key-value pairs in a row and column combination. Firstly, the data in the table is divided into groups of rows and within each row group, the data is stored in columns. RC File first stores the metadata of a row group as part of the key and all the row group data as the value. So it can be assumed similar to a Sequence File format. However, the row-store and column-store advantages give an edge over the Sequence File format.

Pros

  1. RC File combines the advantages of both row-store and column-store to satisfy fast data loading and query processing.
  2. As a row-store, it guarantees that data in the same row is located in the same node.
  3. As a column-store, it improves the faster query processing and loading.
  4. Supports significant compression due to binary format.

Cons

  1. Does not support schema evolution and the file needs to be re-written. So this file format is not suggested if there are frequent updates in the table.
  2. For a wide column table, if the columns are being queried in different locations on a disk or perhaps on different nodes then query performance can be hugely impacted.
Internal storage of RC File

ORC(Optimized Record Columnar) File

ORC File is an enhancement to the RC File. It provides better compression and faster processing. Similar to RC File, it divides the rows into groups called Stripes and along with the rows it stores the index data as well as footer data. The index data stores the statistics of the columns such as min, max, count, and sum. It provides almost 75% compression.

Pros

  1. Faster processing and better compression lead to less storage space.

Cons

  1. Does not support schema evolution and the file needs to be re-written. So this file format is not suggested if there are frequent updates in the table.
Internal storage of ORC File

Avro

Avro is a row-based storage format to serialize Hadoop data and store it in binary format. Avro defines the table schema in JSON format for interoperability. It allows specification for independent schema and thus allows schema evolution. It can be used for tables that have evolving schemas and also support nested or complex data structures. New fields can be added, deleted or updated just by creating a new independent schema.

Pros

  1. Files can be split and support block compression.
  2. Schema information is stored along with the data in the file for further processing.

Cons

  1. Not in a human-readable format.
  2. Slower serialization.

3. Schema is needed for reading or writing data as each row can have a different schema.

Internal structure for Avro

Parquet

It is a column-oriented binary file format. It is good for handling nested data structures. Just like the RC File/ ORC File format, Parquet provides good compression benefits along with the read benefits due to the columnar storage format. However, it is computationally intensive to write the data. It uses Snappy as a compression technique. The metadata is stored along with the data.

Pros

  1. Optimized for compression as well great results on query performances for reading data.
  2. Good for wide column table where only a subset of columns need querying.
  3. Files can be split and support block as well as file level compression.
  4. Supports schema evolution by adding new columns at the end of the structure.

Cons

  1. Computationally write intensive and slower than non-columnar file formats.
  2. Not a good file format to choose for tables where need to query a large number of columns or all columns.

Parquet File format handles the nested data structures and creates a flat columnar structure based on the Dremel paper using a concept of Repetitions & Definition level(out of scope for this article).

With so many storage formats, analyzing the use case requirement and data complexity choosing an appropriate file format would not only benefit in saving storage space but improve the processing of the data be it written or read.

--

--