Using Avro to enhance cloud storage. 77% Cost reduction

Published in

VirtusLab

6 min readSep 25, 2023

By moving from CosmosDB to Azure Storage and using the Avro format, we reduced storage costs by 77%. And reduced storage size by 63%.

A success story details you can find here.

This is the story of how we achieved this result.

Use case

Before I explain what we did to improve storage efficiency, I want to share some of our system requirements.

1. Scope

We build, maintain and improve data-intensive applications for a global freight forwarder.

While we deal with many dimensions of their business, this blog is about one. Shipment is the primary data entity we work with and it is a huge one consisting of thousands of rows. The Shipment data entity describes packages moving around the world. It’s not data that can be easily broken down into smaller pieces. It has many options, events and related parts that we need to work with as a whole, a bit like a document.

In normal system design, we would break the data down into smaller pieces, which allows us to easily apply reason and logic, and compress space.

To split information into smaller sections, developers must first understand it completely. Our client is striving to standardise shipment data, but the project is still ongoing. At present, they cannot offer a complete definition and segmenting the data would lead to confusion.

Due to time constraints, we opted not to divide the shipment data into smaller parts. Instead, we chose to employ more advanced logic in the code.

2. Performance

As mentioned earlier, this is a data-intensive application.

For the Shipping entity storage:
Read/Get: 54.68 million requests per month
Update/Create/Replace: 29.86 million requests per month

This number is increasing every month.

One of the most essential optimisations for performance is denormalization. This means that if you create a lot of entities and see that it is slow to retrieve them from persistent storage (and you need them to apply business logic), you can merge them.

Another way of looking at this is that forcing normalisation without understanding all the business logic (even future logic) can be detrimental to performance.

3. The Cloud

Our customer has chosen a cloud provider and they understand that the business environment is continually evolving. Relying on technology that is unavailable in other cloud providers, or cannot be configured independently, puts you at the vendor’s mercy. That’s why we were careful at the start of the project to choose technologies that are widely used and can be run on your computer, preventing dependence on a single supplier.

Early decisions

When we started the project, we had little time to try out different options. With a short time to production and no clarity on normalisation, we decided to play it safe and just store all the shipment data as one.

After a few discussions and proofs of concept, the decision was made to use CosmosDB with the MongoDB API. This was reasonable for the following reasons:

We can store quite large entities as json (up to 4MB)
Vendor lock is limited, we can easily migrate to MongoDB (and use MongoDB for local testing)
It’s fast
We can search within all fields, which is likely to be useful in the future
Geo-replication is built in and would be easy to enable.

There were other options we considered, including Azure Blob Storage. But we were concerned that unstructured data without the ability to search would greatly increase the likelihood of data corruption.

Verified assumptions

As time passed, we began to see problems with our storage design. Don’t get me wrong, the system worked, but the number of small problems we had to solve was growing. And the benefits we were hoping for were diminishing.

What were the issues? Here are the main ones:

Serialisation — I don’t want to go into details, but serialising such large entities was not a pleasure. The official MongoDB driver doesn’t work well with large structures combined with Scala ADT. We definitely spent too much time trying to make this work.
Cost — the price of CosmosDB was much higher than we expected.
Geo-replication — our client really wanted this in the beginning, but after seeing the cost, the decision was made not to enable it.
Search — Marketing misled us. A search that works with 1GB of data may not work as expected with 500GB. We were constantly hitting timeouts and the cost of a full search query in CosmosDB was huge. After a while, we stopped using it altogether. Parts of the data we needed to search were copied to storage elsewhere (e.g. PostgreSQL), and we accessed data in CosmosDB only by partition ID.
Data corruption — Schemaless DB has no schema validation, so we can put anything in there. It’s very flexible, but you have to make sure that your changes are compatible with what you already have in storage.

We were confident that data corruption would not affect us. We do a lot of testing. We check pull requests carefully. But how do we remember every change in schema-less storage? Imagine a logic you created a year ago that was never released to users, hidden under the feature flag, and then removed because of changing requirements. A piece of logic that added a field to 0.5% of entities and the storage was not “cleaned”. Then you must reintroduce it a year later but with a different field type.

Do you know what happens when your application tries to read it? Yes, it will throw an error and may block your entire processing flow. We knew there had to be a way to protect against that. And the idea came from Kafka, which we also use.

With Kafka, we were constantly changing fields, but the Confluent Cloud Schema Registry protected us from corrupting data. You have to specify schemas and validate compatibility.

It got us thinking, what if we could do the same thing with our shipping storage?

Apache Avro

On the advice of Confluent Cloud, we started using Kafka with Apache Avro. It took us a while to understand why it’s a really good format for data.

The basics are as follows:

You can think of it like JSON, but with schema.
You always need a schema to write or read data.
Schema can be stored separately. It doesn’t need to be stored with the data. Imagine in JSON you have:
{ “thisIsVeryLongKey”:”value”}
In Avro, “thisIsVeryLongKey” would be in the schema, and only “value” would be in the data part. More than 2 times the size reduction!
Schema can evolve according to the rules you specify. A good article on why schema are important can be read here: http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html

And, if you want a deeper understanding of Avro, take a look at this great description: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

From our point of view, the main advantages are:

The Avro schema can evolve in a controlled way. You can create a unit test to make sure your change is compatible with ALL versions you have in storage.
You don’t store or send schema all the time, the only part you send / store is used. This greatly reduces the size.

Remember that in the cloud you pay for transfer and storage. Storing and transmitting less data can be a significant cost reduction.

We chose to combine Avro with Azure Storage. This gave us storage that was really well suited to this use case. We never needed a schemaless solution, we needed controlled flexibility without affecting the whole data set.

With Avro, Azure Storage became storage for structured data with controlled evolution.

Pricing and storage calculations

Here’s a small calculation of how our storage and pricing changed in a typical month.

Storage reduced by 63%:

Number of operations:

Find: 54.68M
Update: 29.86M

Operations cost reduced by 77%:

This may not be an astonishing dollar reduction, but it is for one month in one environment. And that is just for shipping warehousing. In a global business, it’s not the only huge entity we deal with :)

Summary

Learning about Apache Avro was really refreshing for me. At first, we didn’t want to use binary formats for fear of compatibility problems. With Avro we have compatibility and storage reduction of more than 50%. It’s safer and cheaper.

I hope this blog post will inspire others to take an interest in data formats and also create some great optimisations.