CSV vs Parquet vs Avro: Choosing the Right Tool for the Right Job

Mikhail Levkovsky
Jun 13, 2019 · 7 min read
Image for post
Image for post
Image Source

Recently at SSENSE, my team ‘Machete’, which is responsible for building part of a recommendation engine, handled a task for which we had to prepare and work with a relatively large data set. The data wasn’t quite big enough to warrant using something like Spark or Hadoop MapReduce, but it was large enough to force us to consider the space and time complexity footprint of storing and reading this data respectively. In short, we needed a file format that could be reasonably compressed and provide fast read times. While investigating solutions to this problem, I had the opportunity to explore the inner workings of a few commonly used file types for storing data. In this article, I share my notes and learnings from this process.

The Contenders

  1. CSV
  2. Parquet
  3. Avro

CSV

Given that our requirements were minimal, the files just included a timestamp, the product I.D., and the product score.

An example output would be:

spring_exclusives.csv

timestamp,product_id,product_score1555480889,46260,0.33818901664441661555480889,85461,0.347867006784373431555480889,85462,0.29329979524843851555480889,113723,0.38037360862177531555480889,113724,0.396843752266727631555480889,113727,0.38768682121073605

Final result

Easy peasy!

Larger Data Sets

The following examples assume a hypothetical scenario of trying to store members and what their brand color preferences are. For example: Sarah has an ID of 10 and she really likes Nike (ID 1) in red, Adidas (ID 2) in blue, and Reebok (ID 3) in green.

Parquet

“a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language”

Columnar storage is fundamentally different from row based storage. While row based storage is optimized for reading a sequence of rows, columnar storage is optimized for picking a section of columns. Let’s look at some code. In-depth documentation can be found on Parquet’s website.

With Parquet I created a nested schema of the following shape:

var schema = new parquet.ParquetSchema({
memberID: { type: 'UINT_32'},
brand_color: {
repeated: true,
fields: {
brand: { type: 'UINT_32'},
color: { type: 'UTF8' },
score: { type: 'FLOAT'}
}
},
timestamp: { type: 'TIMESTAMP_MILLIS' }
});

With this, I generated a file with 2,000,000 members with 10 brand color preferences each. The total file was 335 MB in size.

In order to slightly reduce the file size, I applied snappy codec compression. Doing this was trivial as it was just a small change to the schema.

var schema = new parquet.ParquetSchema({
memberID: { type: 'UINT_32'},
brand_color: {
repeated: true,
fields: {
brand: { type: 'UINT_32', compression: 'SNAPPY'},
color: { type: 'UTF8', compression: 'SNAPPY' },
score: { type: 'FLOAT', compression: 'SNAPPY'}
}
},
timestamp: { type: 'TIMESTAMP_MILLIS' },
});

The file size decreased from 335 MB to 69.7 MB.

It took my script about 50 seconds to go through the whole file for just a simple read. The primary advantage of Parquet, as noted before, is that it uses a columnar storage system, meaning that if you only need part of each record, the latency of reads is considerably lower. Here is the result from 10 reads:

Image for post
Image for post

By just plucking out the memberID column instead of the entire record, the reading time goes from 50 seconds to just 1!

Here is the read function that only plucks the memberID:

async function testParquetRead(){
let reader = await parquet.ParquetReader.openFile('test.parquet');
let cursor = reader.getCursor(['memberID']);
let record = null;
let i = 0;
let lastone;

while(record = await cursor.next()) {
lastone = record;
i++;
}
await reader.close();
console.log(lastone);
console.log(i);
}

I decided to run this 10 times in order to get some significance on the results and make sure that the 1 second isn’t just a one off.

Image for post
Image for post

However, in our case, we needed the whole record at all times, so this wasn’t much of an advantage.

Avro

As per Avro’s own website:

“Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing”

Also, Avro’s storage is row based and not columnar. This makes it more optimized for reading series of entire rows.

With Avro, I created a very similar nested schema with the following shape:

var schema = new avro.Type.forSchema({
type: 'record',
fields: [
{name: 'memberID', type:'int'},
{
name: 'brand_color',
type: {
type: 'array',
items: {
"type": "record",
"name":"brandColor",
"fields":[
{name: 'brand', type: 'int'},
{name: 'color', type: 'string'},
{name: 'score', type: 'float'}
]
},
}
},
{name: 'timestamp', type:'long'}
]
});

With this schema, I then generated a file containing 2,000,000 members with 10 brand color preferences. The total file size was 181 MB, with no compression at all, about half the size of the Parquet file!

While this looked promising, in order to test how it scales, I decided to increase brand color preferences ten-fold, up to 100 for 2,000,000 members, to see how large this file could get. This increase in data closely reflected our actual needs as well. The total file size grew to 1.5GB. While the increase in memory footprint was almost proportional to the increase in the size of the data, it was far too large for quick access and reads.

The next step was to apply some compression and see how much smaller we could make the file. I decided to try this out with the same snappy code as the one used during the Parquet test. Applying the code was trivial, all that is necessary is to change the write from:

var writer = avro.createFileEncoder(‘test.avro’, schema);

to

var writer = avro.createFileEncoder(‘test.avro’, schema, {codec: ‘snappy’, codecs: {snappy: snappy.compress}});

The result:

A file size of 93MB! A fraction of the previous size. For only 10 brand color preferences, the file size dropped to a mere 16.3 MB.

The following JavaScript code goes through the whole file, turns each row into a JSON object, and benchmarks the operation.

function testAvroRead(schema){
let i =0;
console.time('read')
let lastone;
avro.createFileDecoder('test.avro', {codecs: {"snappy": snappy.uncompress }})
.on('metadata', (data)=> {
console.log(data);
})
.on('data', (row) => {
lastone = schema.toString(row);
i++;
})
.on('end', () => {
console.log(i);
console.log(lastone)
console.timeEnd('read')
});
}

Note that the schema passed into the testAvroRead method is the same one that is defined above. This guarantees schema validation, which has to be performed explicitly with Parquet. The read, decompression, and validation of the entire file took just under 3 minutes. The read speed test was run 10 times to make sure the results were significant, and the output is as follows:

Image for post
Image for post

While we lost a bit on the speed of reading, we gained much more on storage and got validation for free!

Conclusion

All these proof of concepts were produced using Node.js and JavaScript. This allowed us to compress much of our data in a robust and scalable manner. The built in schema checks each record and throws an error if and when something doesn’t match the definition, making validation a breeze.

Editorial reviews by Deanna Chow, Liela Touré & Prateek Sanyal

Want to work with us? Click here to see all open positions at SSENSE!

SSENSE-TECH

Ideas and research from the software, data & product teams…

Mikhail Levkovsky

Written by

Code. Ship. Repeat. Build great things with great people. cofounder @configtree

SSENSE-TECH

Ideas and research from the software, data & product teams behind the global fashion platform SSENSE.

Mikhail Levkovsky

Written by

Code. Ship. Repeat. Build great things with great people. cofounder @configtree

SSENSE-TECH

Ideas and research from the software, data & product teams behind the global fashion platform SSENSE.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store