Open File Formats: Avro

Seckin Dinc
8 min readJun 25, 2024

--

Photo by israel palacio on Unsplash

Selecting the right file format is crucial for data-intensive applications, significantly impacting how data is collected, stored, and accessed. Among the various open-source file formats available, Avro and Parquet stand out as the top choices for data and software engineering teams.

Despite their technical similarities, Avro and Parquet excel in different use cases, making it essential for organizations, teams, and developers to choose the appropriate format based on their specific needs.

In this article, we will dive into Avro and develop samples around it.

What is Apache Avro?

Apache Avro™ is the leading serialization format for record data, and first choice for streaming data pipelines. It was developed as part of Apache’s Hadoop project. It offers excellent schema evolution, and has implementations for the JVM (Java, Kotlin, Scala, …), Python, C/C++/C#, PHP, Ruby, Rust, JavaScript, and even Perl.

Key features of Avro

  • Schema-Based: Avro uses a schema to define the structure of the data. The schema is written in JSON and is included in the serialized data, allowing data to be self-describing and ensuring that the reader can understand the data structure without external information.
  • Compact and Fast: Avro data is serialized in a compact binary format, which makes it highly efficient in terms of both storage and transmission.
  • Compression: Avro supports various compression codes such as Snappy, Deflate, Bzip2, Xz
  • Schema Evolution: Avro supports schema evolution, allowing the schema to change over time without breaking compatibility with old data. This is particularly useful in big data environments where data structures might evolve.
  • Rich Data Structures: Avro supports complex data types, including nested records, arrays, maps, and unions, allowing for flexible and powerful data modelling.
  • Interoperability: Avro is designed to work seamlessly with other big data tools and frameworks, especially within the Hadoop ecosystem, such as Apache Hive, Apache Pig, and Apache Spark.
  • Language Agnostic: Avro has libraries for many programming languages, including Java, C, C++, Python, and more, enabling cross-language data exchange.

Avro Examples

In order to use Avro in our local environment we can install related Python packages. There are two packages we can install;

Official Avro package: pip install avro

Fast Avro package: pip install fastavro I will use fastavro in my demonstrations.

Example 1: Serialise and Read Avro Files

As we mentioned in the beginning, Avro is mainly used for serialisation and we will demonstrate it in the first example.

Below you will find two Python scripts. The first one reads a list object containing a dictionary and writes as an Avro file to the file system. The second one reads the Avro file from the file system and prints out the data in it.

# avro-write.py
import fastavro


# Define a function to write Avro records to a file
def write_avro_file(filename, schema, records):
"""
This function takes three parameters: filename (the name of the Avro file to be created),
schema (the Avro schema defined in JSON format), and records (a list of dictionaries representing individual
records to be serialized into Avro format)
"""
with open(filename, "wb") as f:
fastavro.writer(f, schema, records)


# Example data records
# The people list contains two dictionaries, each representing a person with attributes such as name, age, city, and skills.

people = [
{
"name": "Alice",
"age": 30,
"city": "New York",
"skills": [
"Python",
"Data Analysis",
"Machine Learning",
],
},
{
"name": "Bob",
"age": 25,
"city": "San Francisco",
"skills": [
"Java",
"Big Data",
"Cloud Computing",
],
},
]

# Schema variable holds the Avro schema that describes the structure of the data. It specifies a record type named Person with fields name (string), age (integer), city (string), and skills (array of strings).
schema = {
"type": "record",
"name": "Person",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"},
{
"name": "skills",
"type": {
"type": "array",
"items": "string",
},
},
],
}

# Filename for Avro file
avro_filename = "people.avro"

# Write Avro file
write_avro_file(avro_filename, schema, people)

print(f"Avro file '{avro_filename}' successfully created.")

Executing the first Python script to generate the Avro file.

Image by the author

As we successfully generated our Avro file, we can write the second script to read it.

# avro-read.py
import fastavro


# Function to read Avro file and return records
def read_avro_file(filename):
records = []
with open(filename, "rb") as f:
reader = fastavro.reader(f)
for record in reader:
records.append(record)
return records


# Example usage
avro_filename = "people.avro"

# Read Avro file and get records
records = read_avro_file(avro_filename)

# Print or process records as needed
for record in records:
print(record)

Executing the second Python script to read the Avro file.

Example 2: Schema Evolution

Schema evolution in Avro refers to the ability to handle changes in the schema over time while maintaining compatibility with existing data. This is a powerful feature in Avro that allows applications to evolve their data models without breaking compatibility with previously serialised data.

In the first example, we created the schema in the Python script. Normally schemas are managed separately from the code to improve scalability and modularity. We will follow this practise and create our own schema in the file system.

Below you see the V1 schema. We will save the file in the Avro schema extension person_schema_v1.avsc

{
"type": "record",
"name": "Person",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"}
]
}

Now we need to develop the serialisation Python script. It is similar to the write script in the first example with a change of reading schema from the file system.

import fastavro
from fastavro.schema import load_schema


# Define a function to write Avro records to a file
def write_avro_file(filename, schema, records):
with open(filename, "wb") as f:
fastavro.writer(f, schema, records)


# Example data records
people = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
]

# Avro schema (Version 1) reading from the file system
schema_v1 = load_schema("person_schema_v1.avsc")

# Filename for Avro file
avro_filename = "people.avro"

# Write Avro file using initial schema (Version 1)
write_avro_file(avro_filename, schema_v1, people)

print(f"Avro file '{avro_filename}' with schema version 1 successfully created.")

Executing the first Python script to generate the Avro file.

Image by the author

At this step, we need demonstrate the schema evaluation. In order to do that we need to alter the V1 schema. For the demonstration purpose, add the email field is added to the schema as an optional field ("type": ["null", "string"]) with a default value of null. This ensures backward compatibility with the existing data serialized using the initial schema (person_schema_v1.avsc).

Below you see the V2 schema. We will save the file in the Avro schema extension person_schema_v2.avsc

{
"type": "record",
"name": "Person",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}

Now we need to write the read Python script. It is similar to the read script in the first example with a change of reading schema from the file system.

import fastavro
from fastavro.schema import load_schema


# Define a function to read Avro file and return records
def read_avro_file(filename, schema):
records = []
with open(filename, "rb") as f:
reader = fastavro.reader(f, schema)
for record in reader:
records.append(record)
return records


# Avro schema (Version 2) reading from the file system
schema_v2 = load_schema("person_schema_v2.avsc")

# Read Avro file with evolved schema (Version 2)
records = read_avro_file("people.avro", schema_v2)

# Print or process records as needed
print("Records read from Avro file with schema version 2:")
for record in records:
print(record)

Executing the first Python script to read the Avro file.

Image by the author

As we can see the evaluation of the Avro schema through the versions didn’t break the Python script and we successfully read the serialised data with the new version.

Example 3: File Compression

Avro supports various compression codecs such as Snappy, Deflate, Bzip2, Xz. In this example, I will generate a dummy big CSV file to demonstrate the compression capabilities of Avro.

First we need to generate the dummy data. I will develop a simple Python script for that purpose. Below you can see the context of the dummy_data_generator.py

import csv
import random
import string


# Function to generate random string
def random_string(length):
letters = string.ascii_letters
return "".join(random.choice(letters) for i in range(length))


# Number of rows to generate
num_rows = 10000000

# File path for the large CSV file
csv_file = "large_file.csv"

# Create CSV file
with open(csv_file, "w", newline="") as f:
writer = csv.writer(f)
# Write header
writer.writerow(["id", "name", "age", "city"])
# Write data rows
for i in range(num_rows):
writer.writerow(
[i, random_string(10), random.randint(18, 80), random_string(10)]
)

print(f"CSV file '{csv_file}' successfully created.")

Executing the Python script to generate the dummy CSV file and later let’s see the file size.

Image by the author

The next step is to create a Python script to read the CSV file and then serialise it. The main difference we are going to include into the code is the compression codec. For this demonstration I will use snappy codec.

import fastavro
from fastavro.schema import load_schema
import csv

# Avro schema definition
avro_schema = {
"type": "record",
"name": "Person",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"},
],
}


# Function to write Avro file with compression
def write_avro_file_with_compression(csv_filename, avro_filename, schema, compression):
records = []
# Read CSV file
with open(csv_filename, newline="") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
records.append(
{
"id": int(row["id"]),
"name": row["name"],
"age": int(row["age"]),
"city": row["city"],
}
)
# Limit to avoid excessive memory usage during demonstration
if len(records) >= 1_000_000:
break

# Write Avro file with compression
with open(avro_filename, "wb") as f:
fastavro.writer(f, schema, records, codec=compression)


# File path for the Avro file
avro_file = "compressed_data.avro"
compression_codec = (
"snappy" # Change this to 'deflate', 'bzip2', or 'xz' for other compression options
)

# Write Avro file with compression
write_avro_file_with_compression(
"large_file.csv", avro_file, avro_schema, compression_codec
)

print(
f"Avro file '{avro_file}' with compression '{compression_codec}' successfully created."
)

Executing the Python script to read the dummy CSV file and later let’s see the file size of the new Avro file.

We compressed the 323 MB CSV file to 25 MB with Avro serialisation. Not bad, not bad!

Conclusion

Avro is a powerful and versatile open-source data serialisation framework that excels in efficiently handling large volumes of data, making it a popular choice in big data ecosystems. Its schema-based architecture ensures robust data integrity and supports schema evolution, allowing seamless updates to data structures without compromising backward compatibility.

Avro’s compact binary format significantly reduces storage requirements and enhances data transmission speeds, while its support for various compression codecs further optimises storage and performance.

Additionally, Avro’s interoperability across different programming languages and its seamless integration with data tools make it an indispensable asset for organisations aiming to manage and process complex datasets efficiently.

Overall, Avro’s combination of efficiency, flexibility, and compatibility positions it as a premier choice for data serialisation in modern data-intensive applications.

--

--

Seckin Dinc

Building successful data teams to develop great data products