CSV vs Parquet vs Avro: Choosing the Right Tool for the Right Job

Mikhail Levkovsky
Jun 13, 2019 · 7 min read
Image Source

Recently at SSENSE, my team ‘Machete’, which is responsible for building part of a recommendation engine, handled a task for which we had to prepare and work with a relatively large data set. The data wasn’t quite big enough to warrant using something like Spark or Hadoop MapReduce, but it was large enough to force us to consider the space and time complexity footprint of storing and reading this data respectively. In short, we needed a file format that could be reasonably compressed and provide fast read times. While investigating solutions to this problem, I had the opportunity to explore the inner workings of a few commonly used file types for storing data. In this article, I share my notes and learnings from this process.

The Contenders

The 3 file types tested are:

  1. CSV
  2. Parquet
  3. Avro

CSV

To test CSV I generated a fake catalogue of about 70,000 products, each with a specific score and an arbitrary field simply to add some extra fields to the file.

Given that our requirements were minimal, the files just included a timestamp, the product I.D., and the product score.

An example output would be:

spring_exclusives.csv

timestamp,product_id,product_score1555480889,46260,0.33818901664441661555480889,85461,0.347867006784373431555480889,85462,0.29329979524843851555480889,113723,0.38037360862177531555480889,113724,0.396843752266727631555480889,113727,0.38768682121073605

Final result

A file with roughly 70,000 lines with a size of 1.3MB. Using Node.js’ built-in fs createReadStream, it took about 122ms to go through the whole file to parse and process each line.

Easy peasy!

Larger Data Sets

CSV works well for a small data set of 70,000 lines. However, what if you need to increase to 2 million records? What if each record has nested properties? While CSV files are simple and human-readable, they unfortunately do not scale well. As the file size grows, load times become impractical, and reads cannot be optimized. Their primitive nature allows them to be a great option for smaller data sets, as shown above, but very inconvenient for managing larger sets of data. This is where both Parquet and Avro come in.

The following examples assume a hypothetical scenario of trying to store members and what their brand color preferences are. For example: Sarah has an ID of 10 and she really likes Nike (ID 1) in red, Adidas (ID 2) in blue, and Reebok (ID 3) in green.

Parquet

Apache Parquet defines itself as:

“a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language”

Columnar storage is fundamentally different from row based storage. While row based storage is optimized for reading a sequence of rows, columnar storage is optimized for picking a section of columns. Let’s look at some code. In-depth documentation can be found on Parquet’s website.

With Parquet I created a nested schema of the following shape:

var schema = new parquet.ParquetSchema({
memberID: { type: 'UINT_32'},
brand_color: {
repeated: true,
fields: {
brand: { type: 'UINT_32'},
color: { type: 'UTF8' },
score: { type: 'FLOAT'}
}
},
timestamp: { type: 'TIMESTAMP_MILLIS' }
});

With this, I generated a file with 2,000,000 members with 10 brand color preferences each. The total file was 335 MB in size.

In order to slightly reduce the file size, I applied snappy codec compression. Doing this was trivial as it was just a small change to the schema.

var schema = new parquet.ParquetSchema({
memberID: { type: 'UINT_32'},
brand_color: {
repeated: true,
fields: {
brand: { type: 'UINT_32', compression: 'SNAPPY'},
color: { type: 'UTF8', compression: 'SNAPPY' },
score: { type: 'FLOAT', compression: 'SNAPPY'}
}
},
timestamp: { type: 'TIMESTAMP_MILLIS' },
});

The file size decreased from 335 MB to 69.7 MB.

It took my script about 50 seconds to go through the whole file for just a simple read. The primary advantage of Parquet, as noted before, is that it uses a columnar storage system, meaning that if you only need part of each record, the latency of reads is considerably lower. Here is the result from 10 reads:

By just plucking out the memberID column instead of the entire record, the reading time goes from 50 seconds to just 1!

Here is the read function that only plucks the memberID:

async function testParquetRead(){
let reader = await parquet.ParquetReader.openFile('test.parquet');
let cursor = reader.getCursor(['memberID']);
let record = null;
let i = 0;
let lastone;

while(record = await cursor.next()) {
lastone = record;
i++;
}
await reader.close();
console.log(lastone);
console.log(i);
}

I decided to run this 10 times in order to get some significance on the results and make sure that the 1 second isn’t just a one off.

However, in our case, we needed the whole record at all times, so this wasn’t much of an advantage.

Avro

Avro’s big advantage is the schema, which is much richer than Parquet’s. It allows us to evolve the schema by adding, removing or modifying the columns of a record, with much greater ease than Parquet.

As per Avro’s own website:

“Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing”

Also, Avro’s storage is row based and not columnar. This makes it more optimized for reading series of entire rows.

With Avro, I created a very similar nested schema with the following shape:

var schema = new avro.Type.forSchema({
type: 'record',
fields: [
{name: 'memberID', type:'int'},
{
name: 'brand_color',
type: {
type: 'array',
items: {
"type": "record",
"name":"brandColor",
"fields":[
{name: 'brand', type: 'int'},
{name: 'color', type: 'string'},
{name: 'score', type: 'float'}
]
},
}
},
{name: 'timestamp', type:'long'}
]
});

With this schema, I then generated a file containing 2,000,000 members with 10 brand color preferences. The total file size was 181 MB, with no compression at all, about half the size of the Parquet file!

While this looked promising, in order to test how it scales, I decided to increase brand color preferences ten-fold, up to 100 for 2,000,000 members, to see how large this file could get. This increase in data closely reflected our actual needs as well. The total file size grew to 1.5GB. While the increase in memory footprint was almost proportional to the increase in the size of the data, it was far too large for quick access and reads.

The next step was to apply some compression and see how much smaller we could make the file. I decided to try this out with the same snappy code as the one used during the Parquet test. Applying the code was trivial, all that is necessary is to change the write from:

var writer = avro.createFileEncoder(‘test.avro’, schema);

to

var writer = avro.createFileEncoder(‘test.avro’, schema, {codec: ‘snappy’, codecs: {snappy: snappy.compress}});

The result:

A file size of 93MB! A fraction of the previous size. For only 10 brand color preferences, the file size dropped to a mere 16.3 MB.

The following JavaScript code goes through the whole file, turns each row into a JSON object, and benchmarks the operation.

function testAvroRead(schema){
let i =0;
console.time('read')
let lastone;
avro.createFileDecoder('test.avro', {codecs: {"snappy": snappy.uncompress }})
.on('metadata', (data)=> {
console.log(data);
})
.on('data', (row) => {
lastone = schema.toString(row);
i++;
})
.on('end', () => {
console.log(i);
console.log(lastone)
console.timeEnd('read')
});
}

Note that the schema passed into the testAvroRead method is the same one that is defined above. This guarantees schema validation, which has to be performed explicitly with Parquet. The read, decompression, and validation of the entire file took just under 3 minutes. The read speed test was run 10 times to make sure the results were significant, and the output is as follows:

While we lost a bit on the speed of reading, we gained much more on storage and got validation for free!

Conclusion

For our specific case, the winner was clearly Avro. Note however, that this does not discredit Parquet by any means. The data lake at SSENSE heavily relies on Parquet, which makes sense given that we deal with immutable data and analytics queries, for which columnar storage is optimal.

All these proof of concepts were produced using Node.js and JavaScript. This allowed us to compress much of our data in a robust and scalable manner. The built in schema checks each record and throws an error if and when something doesn’t match the definition, making validation a breeze.


Editorial reviews by Deanna Chow, Liela Touré & Prateek Sanyal

Want to work with us? Click here to see all open positions at SSENSE!

ssense-tech

Ideas and research from the software, data & product teams behind the global fashion platform SSENSE.

Mikhail Levkovsky

Written by

Code. Ship. Repeat. Build great things with great people. cofounder @configtree

ssense-tech

Ideas and research from the software, data & product teams behind the global fashion platform SSENSE.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade