Advanced File Formats and Compression Techniques

Solon Das
Towards Data Engineering
5 min readApr 6, 2024

There are several file formats that we use for data processing and data storage. In this detailed blogpost we will learn about all the relevant file formats, Why we need them, What features they bring to the table and when to use which file format.

The below diagram gives a brief view about the reason for having various file formats, the features that are associated with file formats and various different file formats used widely in today’s industry.

Types of File Formats : Why we need them and the features they provide

There are broadly 2 categories of file formats:

  1. Row Based :
    - The entire record, all the column values of a row are stored together followed by the values of the subsequent records.
    - Used when faster writes is a requirement as appending new rows is easy in case of row based.
    - Slower reads, reading a subset of columns is not efficient as the entire dataset has to be scanned.
    - Provides Less Compression
    -
    Example : avro, csv, etc.
  2. Column Based:
    The values of a single column of all the records are stored together. Likewise the subsequent column values of the next column for all the records are stored together.
    - Efficient reads when a subset of columns has to be read because of the way data is stored underneath. Only the relevant data can be read without having to go through the entire dataset.
    - Slower writes as the data has to be updated at different places to write even a single new record.
    - Provides very Good Compression as all the values of the same datatypes are stored together.
    - Example : parquet, ORC, etc.
Row based and Column based file format — Data storage methods

File formats that are not the best fit for Big Data Processing:

  1. Text files like CSV :
  • Stores the values as strings/text internally and thereby consumes a lot of memory for storing and processing.
  • If any numeric operations like addition, subtraction, etc have to be performed on numeric like values that are internally stored as string in case of text files, then they have to be casted to desired types like integer, date, long,etc.
  • Casting / conversion is a costly and a time consuming operation.
  • Data size is more and therefore the network bandwidth required to transfer the data is also high.
  • Since the data size is more, I/O operations will also take a lot of time.

2. XML and JSON : These files format have partial schema associated.

  • All the disadvantages of the text file format is also applicable for XML and JSON files
  • Since they have an inbuilt schema associated along with the data, these file formats are bulky.
  • These file formats are not splittable, which implies no parallelism can be achieved.
  • Lot of I/O is involved.

File Formats that are best suited for Big Data Processing :

There are 3 main File Formats well suited for Big Data Problems.

  1. PARQUET
  2. AVRO
  3. ORC
Best Suited File Formats for Big Data Processing

Structure of a Parquet File:

Consider a parquet file (orders_data) - 500 MB with 80000 rows.

HEADER : PAR1

BODY :
1st Row Group : consists of 20000 records
Column Chunk 1 - orderid
- Pages (1MB each - Holds Actual data + Metadata)
Column Chunk 2 - orderdate
- Pages
Column Chunk 3 - customerid
- Pages
Column Chunk 4 - orderstatus
- Pages

2nd Row Group : consists of 20000 records
Column Chunk 1 - orderid
Pages (1MB each - Holds Actual data + Metadata)
Column Chunk 2 - orderdate
Pages
Column Chunk 3 - customerid
Pages
Column Chunk 4 - orderstatus
Pages
3rd Row Group : consists of 20000 records
Column Chunk 1 - orderid
Pages (1MB each - Holds Actual data + Metadata)
Column Chunk 2 - orderdate
Pages
Column Chunk 3 - customerid
Pages
Column Chunk 4 - orderstatus
Pages
..
..
..
..
..
FOOTER : Consists of Metadata

Why is Schema Evolution important ?

Schema evolution allows for easy incorporation of the changes in the schema of evolving data. With time, data might evolve requiring corresponding changes in the schema to be updated.

Events which brings about schema change are :

  • Adding new columns/ fields
  • Dropping existing columns/ fields
  • Changing the datatypes.

Example :

  • Consider the orders dataset with the columns — order_id, order_date loaded into the dataframe and written in parquet file format.
  • Consider another new orders dataset with a newly added column of customer_id. The columns of this new orders data are — order_id, order_date, customer_id. This new data is loaded to the dataframe and written in the parquet file format.
  • Now, if we try to read this data to display the resultant data, where there are changes in the schema of the old data and newly added data, schema merge would not have taken place as it is disabled by default.
  • To enable and incorporate the schema evolution feature, the property mergeSchema has to be enabled by using :
  • option(“mergeSchema”, True)

Compression Techniques :

Why do we need compression ?

  1. To save Storage Space
  2. To reduce I/O cost

What does compression involve ?

  1. Compression involves additional cost
  2. CPU cycles
  3. Time to compress and uncompress the files, specifically if complex algorithms are implemented.

Generalised Compression Techniques :

  1. Snappy:
  • Optimized for speed and gives moderate level of compression. It is highly preferred as it is very fast and provides moderate compression.
  • It is the default compression technique for Parquet and ORC.
  • Snappy by default is not splittable when used with CSV or text file formats.
  • Snappy has to be used with container based file formats like ORC and Parquet to make it splittable.

2. LZO:

  • Optimized for speed with moderate compression. It requires a separate licence as it is not generally distributed along with hadoop. It is Splittable by default

3. Gzip:

  • Provides high compression ratio and therefore it is comparatively slow in processing. It is not splittable by default and has to be used along with container based file formats to make it splittable.

4. Bzip2:

  • Very optimized for storage and provides the best compression and thereby very slow in terms of processing. It is inherently splittable.

Note : Some compressions are optimized for Speed and Compression Ratio

(If the requirement is a higher compression ratio then the speed of compression will be slower as it would involve more CPU Cycles to implement complex algorithms. However, for quick compressions, the compression ratio will be less.)

  • Moderate compressions with high speed is preferred mostly.
  • Data archival would require a high compression ratio.

--

--

Solon Das
Towards Data Engineering

Building Data Infrastructures, Unique perspectives on everything Data. Reach me on LinkedIn : https://www.linkedin.com/in/solondas/