From JSON to CSV to Parquet: The Rise of Apache Parquet as the Ultimate Data Storage Solution.

Published in

featurepreneur

5 min readMay 31, 2023

The ability to store and analyze enormous amounts of data effectively is crucial in the world of big data processing and analytics. A potent remedy for overcoming these difficulties has emerged in the form of Apache Parquet, a columnar storage file format. In this article, we’ll examine Parquet, its role in the big data universe, and the noteworthy advantages and benefits it provides for data storage and the analysis of various file formats (CSV, JSON, Parquet).

What is Columnar data storage?

Columnar data storage is a method of organizing and storing data in a database or file format where data is stored and accessed by columns rather than rows. In traditional row-based storage, data is stored and retrieved in a row-wise manner, where each record or row contains values for all columns. In contrast, columnar storage stores the values of each column together, creating separate storage areas for each column.

In a columnar storage format, all values of a particular column are stored contiguously, allowing for efficient compression and encoding techniques specific to that column’s data type.

What is Parquet?

Apache Parquet is an open-source columnar storage file format designed to store and process large-scale datasets efficiently. It was developed as a collaborative effort by the Apache Software Foundation, with a focus on optimizing performance and enabling high-speed data analysis on big data platforms.

Here’s an example of a student dataset using the columns Roll No, Registration No, Name, Department, and Section:

In Parquet, the data would be stored column-wise, like this:

Column: Roll No
[149, 150, 151, ...]

Column: Registration No
[3123187, 3123188, 3123189, ...]

Column: Name
[Dora, Buji, Lisa, ...]

Column: Department
[Computer Science, Electrical Engineering, Mechanical Engineering, ...]

Column: Section
[A, B, C, ...]

In this example, each column represents a specific attribute of the student data. The values for each attribute are stored separately and consecutively within their respective columns.

With this columnar storage format, the dataset benefits from the advantages of Parquet:

Efficient Compression: Parquet’s compression techniques can exploit patterns and similarities within each column. For instance, Roll No and Registration No might have certain patterns or ranges that can be compressed effectively. This helps reduce the overall storage requirements of the dataset.
Selective Column Reading: With Parquet, selective column reading is possible. If you only need to access specific columns, such as Roll No and Name, only those columns would be read from the disk. This reduces I/O operations and improves query performance, especially when dealing with large datasets.
Column-Level Operations: Parquet’s columnar storage allows for efficient column-level operations. You can perform computations, aggregations, or filtering operations on individual columns, leveraging vectorized operations and parallel processing. This enables faster and more efficient data analysis.

Purpose in Big Data Processing and Analytics: Parquet plays a crucial role in big data processing and analytics by providing a highly optimized storage format. It addresses the limitations of traditional row-oriented storage formats by organizing and compressing data column by column instead of row by row.

Analysis of various file formats (csv, json, parquet):

Now, I have a dataset in JSON format.

os.path.getsize("path-to-the-file")/1000000

This function from the os module gives the size of the file passed as an argument. It gives the size of the file in bytes, which can be converted to MB by dividing it by 10⁶.

The JSON is converted to CSV using the code,

def json_to_csv(file):
    df = pd.read_json(file)
    df.to_csv('data.csv')


json_to_csv('data.json')

Now, the size of the files ,

So, there is a noticeable decrease in the file size for storing the same set of data in CSV as in JSON.

Next, we shall move on to parquet.


def json_to_parquet():
    data = pd.read_json('data.json') 
    data.to_parquet('data.parquet')

This function converts the JSON file to parquet.

data.json File Size: 325.487323 MB

data.csv File Size: 191.802363 MB

data.parquet File Size: 61.8331 MB

When comparing the file sizes of the data.json, data.csv, and data.parquet files, it is evident that there are significant differences in their storage requirements.

The data.json file, with a size of 325 MB, stands out as the largest among the three formats. This can be attributed to the nature of JSON, which utilizes a plain text representation with explicit field names and values stored in a hierarchical structure. Although JSON is a widely used and human-readable format, its larger file size indicates that it may consume more disk space compared to other formats.

On the other hand, the data.csv file has a size of 191 MB. While CSV files are known for their simplicity and widespread usage for tabular data, their file sizes tend to be larger due to the plain text representation. Each record is stored as a row, with values separated by commas. Although CSV is straightforward to work with, the larger file size suggests that it may occupy more storage space.

In contrast, the data.parquet file demonstrates the advantages of the Parquet format. With a considerably smaller size of 61 MB, Parquet stands out as an efficient columnar storage file format. By storing data column by column rather than by row, Parquet achieves better compression ratios and reduces storage requirements. Additionally, Parquet incorporates advanced compression techniques, which contribute to its smaller file size. The selective column reading capability of Parquet further enhances its efficiency, as only the necessary columns are accessed, resulting in faster query performance and minimized disk I/O operations.

In conclusion, the analysis shows that Parquet outperforms both CSV and JSON in terms of file size and storage efficiency. Parquet’s columnar storage and advanced compression enable a significantly smaller file size compared to CSV, making it an ideal choice for big data environments. JSON, on the other hand, tends to have larger file sizes due to its plain text representation and lack of compression.

From JSON to CSV to Parquet: The Rise of Apache Parquet as the Ultimate Data Storage Solution.

Written by Sarumathy P