Big Data File Formats Explained — Data Engineer Beginners

Rahul Sounder
8 min readDec 29, 2023

--

** JSON (JavaScript Object Notation) **

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate. JSON is often used to transmit data between a server and a web application, as well as to store configuration settings and exchange data between different programming languages.

  • JSON represents data as key-value pairs, similar to a dictionary or an associative array in other programming languages.
  • Data is organized in a hierarchical and nested structure using objects and arrays.
  • JSON uses a simple and readable syntax. Data is enclosed in curly braces {} for objects and square brackets [] for arrays.
  • Key-value pairs are separated by colons (:), and elements in an array are separated by commas.
  • JSON supports several data types, including strings, numbers, objects, arrays, booleans, and null.
  • JSON is widely used for configuration files, APIs, and as a data storage format for various applications.

Example of a simple JSON object representing information about a person:

{
"name": "John Doe",
"age": 30,
"city": "New York",
"isStudent": false,
"hobbies": ["reading", "traveling"]
}

In the below example JSON file contains an object with a key “employees,” which maps to an array of two objects representing employee information. Each employee object has keys like “firstName,” “lastName,” “age,” and “department.” The structure of JSON allows for flexibility and ease of representation for various types of data.

{
"employees": [
{
"firstName": "John",
"lastName": "Doe",
"age": 30,
"department": "Engineering"
},
{
"firstName": "Jane",
"lastName": "Smith",
"age": 28,
"department": "Marketing"
}
]
}

** CSV (Comma Separated Values) **

CSV is a simple and widely used file format for storing tabular data (numbers and text) in plain text form. In a CSV file, each line of the file represents a row of data, and the values within each row are separated by commas (or another delimiter)

  • Structure — Data is organized in rows, where each row corresponds to a record or entry.
  • Within each row, values are separated by commas or other delimiters (such as semicolons or tabs).
  • Delimiter — The comma is the most common delimiter used in CSV files, but other delimiters like semicolons or tabs may be used depending on regional conventions or specific requirements.
  • The choice of delimiter is important to avoid conflicts with the data itself. For example, if the data contains commas, using a comma as a delimiter may cause parsing issues.
  • Text Qualification — If a field value contains the delimiter or special characters, the value is often enclosed in double quotes to distinguish it from the delimiter used to separate fields. For example: "John Doe",25,"New York, NY", "Male"
  • Header Row — CSV files often include a header row at the beginning that contains the names of the columns. This row helps to identify the meaning of each column.
Name,Age,City,Gender
  • CSV files typically have a “.csv” file extension.
  • CSV is a platform-independent format and can be easily created and read by a variety of software applications, including spreadsheet programs like Microsoft Excel and database systems.
  • CSV files store data as plain text, so all values are treated as strings. It’s up to the interpreting software to recognize and handle data types appropriately.
  • CSV is commonly used for data interchange between different systems and applications.
Name,Age,City,Gender
John Doe,25,New York, Male
Jane Smith,30,San Francisco, Female
Bob Johnson,22,Chicago, Male

** Parquet **

Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is designed to be highly efficient for both storage and processing of large datasets. Parquet is widely used in the Apache Hadoop ecosystem, particularly with tools like Apache Spark, Apache Hive

  • Columnar Storage — Unlike row-oriented storage formats, such as CSV or JSON, Parquet stores data in a columnar format. This means that values from the same column are stored together, allowing for better compression and improved query performance for analytical workloads.
  • Compression — Parquet uses compression techniques to reduce storage space requirements. The columnar storage format allows for effective compression because similar data types and values are grouped together.
  • Common compression algorithms used with Parquet include Snappy, Gzip, and LZO.
  • Schema Evolution — Parquet supports schema evolution, allowing changes to the data schema over time without requiring the entire dataset to be rewritten. This is beneficial for evolving data structures without significant disruptions to data processing workflows.
  • Predicate PushDown — Parquet enables predicate pushdown, a feature that allows the filtering of data at the storage level before it is read into memory. This minimizes the amount of data that needs to be processed, leading to improved query performance.
  • Metadata — Parquet files contain metadata, including schema information and statistics about the data. This metadata is used by processing engines to optimize queries and filter data efficiently.
  • Data Types — Parquet supports a wide range of data types, including primitive types (integers, floating-point numbers, strings, etc.) and complex types (arrays, maps, structs). This flexibility makes it suitable for diverse data processing needs.
  • Performance and Scalability — Due to its columnar storage and compression, Parquet is well-suited for analytical processing on large datasets. It allows for efficient scanning of specific columns and parallel processing in distributed environments.
  • File Extension — Parquet files typically have a “.parquet” file extension.

Example Parquet File Structure Illustration

<Column 1>
<Value 1>
<Value 2>
...
<Column 2>
<Value 1>
<Value 2>
...
...

While I can’t provide an actual binary representation of a Parquet file, I can provide a simplified example of the logical structure of a Parquet file along with some sample data. Remember that the actual binary format is more complex due to the use of advanced compression and encoding techniques.

Let’s consider a scenario where we have a dataset containing information about users, and we’ll represent this dataset using a few columns: user_id, name, age, and city.

Here’s a simplified representation of a Parquet file structure with sample data:

+-----------------------------------------------------------+
| Parquet File Header |
+-----------------------------------------------------------+
| Metadata (Schema, Compression, etc.) |
+-----------------------------------------------------------+
| Row Group 1 |
| +-----------------------------+-------------------------+
| | user_id | name | age | city |
| +-----------------------------+-------------------------+
| | 1 | Alice | 25 | New York |
| | 2 | Bob | 30 | San Francisco |
| | 3 | Charlie | 28 | Chicago |
| +-----------------------------+-------------------------+
| Row Group 2 |
| +-----------------------------+-------------------------+
| | user_id | name | age | city |
| +-----------------------------+-------------------------+
| | 4 | Dave | 35 | Los Angeles |
| | 5 | Eve | 22 | Seattle |
| +-----------------------------+-------------------------+
+-----------------------------------------------------------+

** Avro **

Avro is a binary serialization format developed within the Apache Hadoop project. It is designed to provide a compact and fast serialization mechanism for data exchange between systems, especially in big data processing environments

  • Schema-Based Serialization — Avro uses a schema to define the structure of the data being serialized. The schema is often defined in JSON format and is used to encode and decode the data.
  • Data Types — Avro supports a rich set of data types, including primitive types (int, long, float, double, boolean, string, bytes) and complex types (record, enum, array, map, union, fixed).
  • Binary Format — Avro serializes data in a compact binary format, resulting in smaller file sizes compared to some text-based formats like JSON or XML. The binary format also contributes to faster data serialization and deserialization.
  • Compression — Avro files can be compressed to further reduce storage requirements. Common compression algorithms, such as Snappy or deflate, can be applied to Avro data. Compression helps minimize storage costs and improve data transfer efficiency.
  • Self-Describing Data — Avro data files are self-describing, meaning they include the schema information along with the serialized data. This makes it easy to interpret the data without needing the schema in advance. The schema is stored at the beginning of the Avro file, allowing readers to understand the structure of the data without external schema files.
  • Forward and Backward Compatibility- Avro supports schema evolution, allowing for changes to the schema over time without breaking compatibility. Both forward and backward compatibility are maintained, meaning new data can be read by old readers, and old data can be read by new readers.
  • Language-Independent — Avro is designed to be language-independent, meaning it can be used across various programming languages. Avro schemas can be defined in JSON, and libraries for reading and writing Avro data are available in multiple programming languages, including Java, Python, C++, and more.
  • File Extension — Avro files typically have a “.avro” file extension
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"}
]
}

Avro Data

{"id": 1, "name": "Alice", "age": 25, "city": "New York"}
{"id": 2, "name": "Bob", "age": 30, "city": "San Francisco"}
{"id": 3, "name": "Charlie", "age": 28, "city": "Chicago"}

In this example, the Avro schema defines a record type named “User” with four fields. The Avro data represents instances of this record with specific values for each field.

** ORC **

ORC (Optimized Row Columnar) is a columnar storage file format designed for use with the Apache Hive data warehouse system. It is highly optimized for performance, especially for complex query processing in big data analytics. ORC files are often used in conjunction with Apache Hive, Apache Spark, and other big data processing frameworks.

  • Columnar Storage — Data is stored in a columnar format, which allows for better compression and improved query performance. This is particularly advantageous for analytical workloads where only a subset of columns is often queried.
  • Compression — ORC supports various compression algorithms, including Zlib, Snappy, and LZO. Compression is applied at the column level, providing efficient storage and reduced I/O.
  • Predicate Pushdown — ORC files support predicate pushdown, a feature that allows the filtering of data at the storage level before it is read into memory. This reduces the amount of data that needs to be processed during query execution.
  • Lightweight Indexing — ORC files include lightweight indexes, known as bloom filters, that help skip irrelevant data blocks during query execution. This further improves query performance.
  • Statistics and Metadata — ORC files store statistics and metadata about the data, including column statistics like minimum and maximum values. This information is used by query engines to optimize query execution plans.
  • Data Types — ORC supports a wide range of data types, including primitive types (integers, floating-point numbers, strings, etc.) and complex types (arrays, maps, structs). This flexibility makes it suitable for diverse data processing needs.
  • Hive Integration — ORC is closely integrated with Apache Hive, making it a popular choice for storing and processing Hive tables.

Sample dataset with information about users

| user_id | name   | age | city       |
|---------|--------|-----|------------|
| 1 | Alice | 25 | New York |
| 2 | Bob | 30 | San Fran |
| 3 | Charlie| 28 | Chicago |

Example ORC File Structure

<Column 1: user_id>
<Value 1>
<Value 2>
<Value 3>

<Column 2: name>
<Value Alice>
<Value Bob>
<Value Charlie>

<Column 3: age>
<Value 25>
<Value 30>
<Value 28>

<Column 4: city>
<Value New York>
<Value San Fran>
<Value Chicago>

Each column is stored separately, and the values within each column are stored in a compressed, columnar format. This structure allows for efficient compression and retrieval of specific columns during query processing.

--

--

Rahul Sounder

Senior Engineering Manager - Data at Xiaomi Technology | Ex-Amazon, Merck | SAFe® 5 Agilist | Certified AWS Solutions Architect