What Are the Differences Between Data Serialization Formats: YAML, JSON, Parquet, Avro, CSV, Pickle, and XML?

Rayan Yassminh
8 min readMar 21, 2023

--

serialization formats

YAML, JSON, Parquet, Avro, CSV, Pickle, and XML are all examples of data serialization formats. These formats are commonly used to represent data in a structured way that can be easily stored, exchanged, and processed by computer systems and applications. Each of these formats has its own strengths and weaknesses, and may be better suited for specific use cases depending on factors such as the complexity of the data, performance requirements, and compatibility with other systems. In this article, I will discuss several data formats and highlight their differences.

  1. YAML, short for “YAML Ain’t Markup Language,” is a human-readable data serialization format that is often used for configuration files, data exchange, and persistence. YAML is designed to be easy to read and write, and it aims to be a more human-friendly alternative to other data serialization formats like JSON and XML. In YAML, data is organized into a collection of key-value pairs, where each key is a string and each value can be a string, number, boolean, array, or object. Data can be nested and indented to create a hierarchy of values, and comments can be added to provide additional context or documentation. Here’s an example of how data can be stored in YMAL format:
# This is a YAML file that lists fruits and their properties
import yaml

# Define the YAML data as a Python string
yaml_data = '''
- name: apple
color: red
weight: 0.25
price:
usd: 0.50
eur: 0.42
nutrients:
- fiber
- vitamin C
- potassium

- name: banana
color: yellow
weight: 0.18
price:
usd: 0.30
eur: 0.25
nutrients:
- fiber
- vitamin B6
- potassium
'''

# Load the YAML data into a Python object
data = yaml.load(yaml_data, Loader=yaml.FullLoader)

# Do something with the data (e.g., print it)
print(data)

# Write the data to a YAML file
with open('fruits.yaml', 'w') as f:
yaml.dump(data, f)

# Read the YAML data from a file
with open('fruits.yaml', 'r') as f:
data = yaml.load(f, Loader=yaml.FullLoader)

# Do something with the data (e.g., print it)
print(data)

2. JSON, short for JavaScript Object Notation, is a lightweight and widely used data-interchange format. It is a text format that is easy to read and write, and it is designed to be both human and machine-readable. JSON is often used to transmit data between a server and a web application, as well as for storing data in a file format. JSON is based on a key-value pair structure, where data is organized into objects, arrays, and values. An object is enclosed in curly braces {}, and it contains a collection of key-value pairs, where each key is a string and each value can be a string, number, boolean, array, or another object. An array is enclosed in square brackets [], and it contains a collection of values, where each value can be a string, number, boolean, array, or object. A value can be a string, number, boolean, null, array, or object.

import json

# Define the JSON data as a Python string
json_data = '''
[
{
"name": "apple",
"color": "red",
"weight": 0.25,
"price": {
"usd": 0.50,
"eur": 0.42
},
"nutrients": [
"fiber",
"vitamin C",
"potassium"
]
},
{
"name": "banana",
"color": "yellow",
"weight": 0.18,
"price": {
"usd": 0.30,
"eur": 0.25
},
"nutrients": [
"fiber",
"vitamin B6",
"potassium"
]
}
]
'''

# Load the JSON data into a Python object
data = json.loads(json_data)

# Do something with the data (e.g., print it)
print(data)

# Write the data to a JSON file
with open('fruits.json', 'w') as f:
json.dump(data, f)

# Read the JSON data from a file
with open('fruits.json', 'r') as f:
data = json.load(f)

# Do something with the data (e.g., print it)
print(data)

3. Parquet is a storage format that stores data in a columnar fashion, in which the values from a single column are stored together, as opposed to the traditional row-wise storage approach. This type of storage can enable more efficient compression and improved query performance. Parquet was originally developed by Cloudera and Twitter as a collaboration effort, and is now an open standard that is commonly used in big data ecosystems such as Apache Hadoop and Apache Spark. Partitioning is a major feature of Parquet, allowing data to be split into smaller files based on specific columns. This feature enables efficient query execution and optimized memory usage. Additionally, Parquet supports complex data types like arrays and maps, as well as nested structures like structs and nested arrays. These complex data types can be flattened and stored in a compact and efficient manner, making Parquet a suitable format for storing hierarchical data.

import pyarrow.parquet as pq

# Define the Parquet data as a Python list of dictionaries
data = [ {'name': 'apple', 'color': 'red', 'weight': 0.25, 'price_usd': 0.50, 'price_eur': 0.42, 'nutrients': ['fiber', 'vitamin C', 'potassium']},
{'name': 'banana', 'color': 'yellow', 'weight': 0.18, 'price_usd': 0.30, 'price_eur': 0.25, 'nutrients': ['fiber', 'vitamin B6', 'potassium']}
]

# Convert the Python data to a PyArrow table
table = pq.Table.from_pydict({k: [d.get(k, None) for d in data] for k in data[0]})

# Write the data to a Parquet file
pq.write_table(table, 'fruits.parquet')

# Read the Parquet data from a file
table = pq.read_table('fruits.parquet')

# Convert the PyArrow table to a list of dictionaries
data = [{k: v[i].as_py() for k, v in table.to_pydict().items()} for i in range(table.num_rows)]

# Do something with the data (e.g., print it)
print(data)

4. Avro is a system used for the serialization of data to facilitate the exchange of data between various platforms, programming languages, and systems. It is an essential component of the Hadoop ecosystem, developed by the Apache Software Foundation, and finds extensive applications in big data systems. The Avro format is designed to be efficient and compact, and it uses a schema-based approach to describe the data structure. This schema-based design makes it self-describing and allows for easy schema evolution. Additionally, Avro provides support for binary encoding and data compression, which significantly improves its efficiency over other serialization formats like JSON and XML.

Here’s an example of how data can be stored in Avro format:

import fastavro

# Define the Avro schema
schema = {
"type": "record",
"name": "fruit",
"fields": [
{"name": "name", "type": "string"},
{"name": "color", "type": "string"},
{"name": "weight", "type": "float"},
{"name": "price_usd", "type": "float"},
{"name": "price_eur", "type": "float"},
{"name": "nutrients", "type": {"type": "array", "items": "string"}}
]
}

# Define the Avro data as a Python list of dictionaries
data = [
{"name": "apple", "color": "red", "weight": 0.25, "price_usd": 0.50, "price_eur": 0.42, "nutrients": ["fiber", "vitamin C", "potassium"]},
{"name": "banana", "color": "yellow", "weight": 0.18, "price_usd": 0.30, "price_eur": 0.25, "nutrients": ["fiber", "vitamin B6", "potassium"]}
]

# Write the data to an Avro file
with open("fruits.avro", "wb") as avro_file:
fastavro.writer(avro_file, schema, data)

# Read the Avro data from a file
with open("fruits.avro", "rb") as avro_file:
reader = fastavro.reader(avro_file)
data = [record for record in reader]

# Do something with the data (e.g., print it)
print(data)

5. CSV stands for Comma-Separated Values. It is a simple file format used for storing and exchanging tabular data, such as spreadsheets or databases. In a CSV file, each row represents a record, and each column represents a field within that record. The fields are separated by a comma, and each row is separated by a newline character. CSV files are widely used because they are easy to create, edit, and parse. They can be opened by most spreadsheet applications, database software, and programming languages. CSV files are also lightweight and efficient, making them suitable for large datasets. However, CSV files have some limitations. They cannot store complex data structures, such as nested objects or arrays, and they do not support data typing or schema validation. Additionally, CSV files may have issues with encoding, escaping characters, and handling null or empty values. Overall, CSV is a popular and simple file format for storing and exchanging tabular data. However, it may not be suitable for all use cases, especially when dealing with complex data structures or requiring data typing and validation.

import csv

# Writing the fruit data to a CSV file
data = [ {'name': 'apple', 'color': 'red', 'weight': 0.25, 'price_usd': 0.50, 'price_eur': 0.42, 'nutrients': 'fiber, vitamin C, potassium'}, {'name': 'banana', 'color': 'yellow', 'weight': 0.18, 'price_usd': 0.30, 'price_eur': 0.25, 'nutrients': 'fiber, vitamin B6, potassium'}]

with open('fruits.csv', mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Color', 'Weight', 'Price (USD)', 'Price (EUR)', 'Nutrients'])
for fruit in data:
writer.writerow([fruit['name'], fruit['color'], fruit['weight'], fruit['price_usd'], fruit['price_eur'], fruit['nutrients']])

# Reading the fruit data from the CSV file
with open('fruits.csv', mode='r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row)

In this example, each row represents a fruit, with columns for the fruit’s name, color, and price. The first row contains the column headers, which indicate the name of each field. The values in each row are separated by commas, and each row is separated by a newline character.

6. Pickle — Pickle is a Python-specific format used for serializing Python objects into a binary format that can be stored or transmitted. It supports all Python data types and structures, including functions and classes, and is often used for inter-process communication or saving program state. Pickle is not interoperable with other programming languages and can pose security risks if used improperly.

import pickle

fruits = [
{
'name': 'apple',
'color': 'red',
'weight': 0.25,
'price': {'usd': 0.50, 'eur': 0.42},
'nutrients': ['fiber', 'vitamin C', 'potassium']
},
{
'name': 'banana',
'color': 'yellow',
'weight': 0.18,
'price': {'usd': 0.30, 'eur': 0.25},
'nutrients': ['fiber', 'vitamin B6', 'potassium']
}
]

# Store the data using pickle
with open('fruits.pickle', 'wb') as f:
pickle.dump(fruits, f)

# Load the data from pickle
with open('fruits.pickle', 'rb') as f:
loaded_fruits = pickle.load(f)

print(loaded_fruits)

7. XML (Extensible Markup Language) is a markup language that is widely used for data exchange between different systems, programming languages, and platforms. It is a standard format that is easy to read and interpret by both humans and machines. XML allows users to define their tags, which makes it highly customizable and versatile.

XML uses a tree structure to represent the data, with each element represented by a tag. These tags can contain attributes, text, or child elements. The tags and attributes are defined by the user, which makes it easy to represent complex data structures.

Here is an example of an XML document:

<?xml version="1.0" encoding="UTF-8"?>
<fruits>
<fruit>
<name>apple</name>
<color>red</color>
<weight>0.25</weight>
<price>
<usd>0.50</usd>
<eur>0.42</eur>
</price>
<nutrients>
<nutrient>fiber</nutrient>
<nutrient>vitamin C</nutrient>
<nutrient>potassium</nutrient>
</nutrients>
</fruit>
<fruit>
<name>banana</name>
<color>yellow</color>
<weight>0.18</weight>
<price>
<usd>0.30</usd>
<eur>0.25</eur>
</price>
<nutrients>
<nutrient>fiber</nutrient>
<nutrient>vitamin B6</nutrient>
<nutrient>potassium</nutrient>
</nutrients>
</fruit>
</fruits>

In this example, the XML document represents a list of fruits, with each fruit element containing information about the fruit’s name, color, and shape. The document starts with an XML declaration that specifies the version and encoding used in the document. Each fruit is represented as an XML element, with the name, color, and shape represented as child elements. The fruit elements are then grouped together under a parent element called “fruits”. This XML format is easy to read and interpret by both humans and machines, making it a useful format for data exchange.

In summary, the choice of serialization format depends on the specific use case and requirements. YAML and JSON are good choices for simple data structures and inter-language communication, Parquet is optimal for big data processing, CSV is useful for tabular data, and Pickle is best for Python-specific serialization needs.

--

--

Rayan Yassminh

I am a machine learning scientist with a broad STEM background. I aim to explain basic concepts using practical and straightforward examples.