The Power of Avro: Unlocking Data’s Potential

Mariam Khan
Bazaar Engineering
Published in
6 min readSep 15, 2023

The right data format can make or break your data strategy. The choice of data format can significantly impact how efficiently and effectively information is stored and exchanged. Think of data formats as the language through which computers communicate and understand data.

For years, JSON, XML, and CSV have served as the go-to data formats for various applications, from web APIs to data storage and exchange. However, as data complexity and volume continue to surge, these traditional formats often fall short. Let’s illustrate this with a simple example:

Imagine you’re managing a database of customer information for an e-commerce platform. You’re storing data like customer names, email addresses, and purchase history.

Here’s how each format handles this data.

JSON: It is human-readable and flexible. You might represent a customer’s data like this:

{
"name": "John Doe",
"email": "johndoe@email.com",
"purchases": [
{
"product": "Widget A",
"price": 19.99
},
{
"product": "Widget B",
"price": 29.99
}
],
"accountStatus": "ACTIVE"
}
  • JSON’s readability and simplicity are its strength, and it becomes the first choice when flexibility is needed but it has its shortcomings, especially when dealing with data consistency, compatibility, and evolution because it is schema-less by design.

XML: It offers hierarchical structure and flexibility. It could look like this:

<customer>
<name>John Doe</name>
<email>johndoe@email.com</email>
<purchases>
<purchase>
<product>Widget A</product>
<price>19.99</price>
</purchase>
<purchase>
<product>Widget B</product>
<price>29.99</price>
</purchase>
</purchases>
<accountStatus>ACTIVE</accountStatus>
</customer>
  • XML’s flexibility comes at the cost of increased verbosity, making it less efficient for large-scale data processing.

CSV: It is simple and widely supported. In CSV, the data will store as comma separated values like this:

"John Doe","johndoe@email.com","Widget A",19.99,"ACTIVE"
"John Doe","johndoe@email.com","Widget B",29.99,"ACTIVE"
  • CSV’s simplicity is admirable, but it struggles with complex, nested data structures and lacks a standardized way to represent data types.

While these formats have proven their worth, they fall short when it comes to addressing the demands of modern data-driven ecosystems. In response to these limitations, we have now more elegant and efficient alternatives such as Avro and Protobuf. However, for the focus of this article, we will delve specifically into Avro’s capabilities and how it addresses the challenges posed by traditional formats.

What is Avro?

Apache Avro is a data serialisation system that provides a way to exchange data between systems.

It was developed as part of the Apache Hadoop ecosystem but is now a standalone project in the Apache Software Foundation.

Why Avro?

Efficient Data Serialization/ Deserialization

Avro uses a compact binary format, which reduces the size of the serialized data significantly compared to text-based serialization formats like XML and JSON. This compactness leads to faster data transmission over networks and reduces storage requirements.

Language independence

Avro is language-neutral, means it supports multiple programming languages, enabling seamless data exchange between different systems.

Schema Definition

In Avro, a schema serves as a blueprint for data. It defines the structure of data, including field names, types, and optional attributes making data self-descriptive. Avro also allows us to add documentation in the schema which helps in avoiding any confusion when handling data and allows other team members to understand the data’s purpose without needing to ask you for explanations.

Avro schemas are written in JSON, making them human-readable and versatile. Let’s take a look of how we define schema in Avro.

{
"type": "record",
"name": "CustomerInfo",
"fields": [
{ "name": "name", "type": "string" },
{ "name": "email", "type": "string" },
{
"name": "purchases",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Purchase",
"fields": [
{ "name": "product", "type": "string" },
{ "name": "price", "type": "float" }
]
}
}
},
{
"name": "accountStatus",
"type": {
"type": "enum",
"name": "AccountStatus",
"symbols": ["ACTIVE", "INACTIVE", "SUSPENDED"]
}
}
]
}

Let’s deep dive into Avro schema and discuss it’s elements:

  • Type: Avro schemas specify the data type of a field. Avro supports primitive types (e.g., string, int, float) and complex types (e.g., records, enums, arrays, maps, unions).
  • Field Names: Each field in a schema has a name that uniquely identifies it within the schema.
  • Default Values: You can define default values for fields, which are used when a field is not present in the data.
  • Documentation: Schemas can include optional documentation to describe the purpose and usage of fields, providing clarity for users.

This schema is stored along with the serialized data, enabling schema validation during both serialization and deserialization. This helps catch data inconsistencies early and ensures data integrity.

Schema Evolution

Schema evolution allows you to evolve the structure of your data over time like adding new fields, removing existing fields, and data types can change while maintaining compatibility with previously serialized data.

When reading Avro data, the reader’s schema is compared to the schema that was used to write the data. This process is known as schema resolution. Avro handles any differences between the schemas by applying compatibility rules ensuring that the data can be correctly interpreted. There are mainly three directions of schema evolution:

Backward Compatibility: Backward compatibility ensures that older readers can read data written using newer schemas without any issues.

Forward Compatibility: Forward compatibility ensures that newer readers can read data written using older schemas without errors.

Full Compatibility: This direction is the combination of both Backward and Forward compatibility.

Dynamic Typing

Avro’s schema is present in the serialized data itself. This allows readers to dynamically understand the structure and types of the data they’re reading without needing to be aware of the schema beforehand. This is in contrast to formats like JSON, where external schema information is not typically embedded in the data.

Conclusion

Avro has emerged as a compelling and versatile solution designed to tackle the challenges of modern data. Avro combines the elegance of human-readable structure with the efficiency of binary encoding, striking a harmonious balance that empowers organizations to harness the full potential of their data.

Through schema-awareness, Avro not only ensures data integrity but also simplifies the process of adapting to evolving data structures. Its compactness reduces storage costs and accelerates data transmission, making it a go-to choice for big data, streaming applications, and distributed systems.

Disclaimer

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the authors, which might or might not be the actual and official perspective of Bazaar Technologies.

--

--