Serialization and Deserialization in Python

Comparison of popular serialization formats

“No one can whistle a symphony. It takes a whole orchestra to play it.”
— Halford E. Luccock

One of the most notable characteristics of humans which allowed the Mankind to overcome a lot of obstacles on their way towards Progress is collaboration. Collaboration allows a whole to become greater than the sum of its parts and the opening quote gives a neat description of that.

Another characteristic, which becomes more and more important nowadays, is ability to automate routines and free up mental resources for creative activities.

Collaboration is possible through communication and automation is a prerogative of computers.

Modern computers are very powerful. However, just like in case of a single human, performance of a single computer is limited and multiple computers have to collaborate to step beyond their individual limits. Thus, computers need to communicate too.

Communication is a quite complex activity, which includes multiple overheads. One of such overheads is SerDe — Serialization and Deserialization of messages — a process of conversion of a structured message into a format which can be used for serial transmission of data over a medium and restoring it back.


Goals

This article observes and compares a set of popular serialization formats which can be used for transportation and processing of big batches of structured data (hundreds of thousands of records) in Python programming language. It’s goal is to share results and experience of examination of pros and cons of the following formats:

  1. CSV
  2. JSON
  3. JSON Lines
  4. msgpack
  5. Avro
  6. Protobuf
  7. Cap’n Proto

I hope that this article will help its readers to decide which format may fit their needs better.


A thing to think about

Also, I’d like you always to keep in mind that being a software engineer and operating almost only with such non-materialistic entities as ideas does not mean you must not think about impact of your ideas on materialistic world. Actually you must, especially nowadays, when a simple software solution can easily propagate and scale around the Globe. And with respect to the Nature, dear reader, please make sure your solutions are optimal and do not waste energy and materials. Selection of a proper serialization format firmly relates to this.


Comparison apparatus

To compare several entities with each other, a list of criteria must be defined. In this article we will use many criteria. To make it easier to comprehend, they are split into three groups: primary, secondary and extra.

Primary criteria group

Primary group includes most important criteria which relate to resource usage:

  1. CPU consumption: measured by execution time;
  2. main memory consumption: measured by resident set size (RSS);
  3. external memory consumption: measured by size of data which is going to be transferred and may be buffered to disc. This includes both compressed (tar.gz) and non-compressed data.

Secondary criteria group

Next, secondary group includes criteria related to data type detection and validation:

  1. supported set of data types;
  2. available level of validation;
  3. versioning.

Extra criteria group

Finally, extra group defines criteria which might be useful, but not so critical as previous:

  1. support of empty (optional) values;
  2. ability to process batches of data as streams;
  3. restrictions for format of field names;
  4. steepen learning curve: how hard it can be to start using it;
  5. human-orientation: ability to be able to be easily comprehended by humans;
  6. computer-orientation: ability to perform actions in an optimal way.

Test environment

Primary group of defined criteria requires us to examine consumption of fundamental resources: time and space. This requirement is going to be satisfied by doing an empirical research: several measurements must be taken for each selected format of serialization. Of course, results of measurements depend on environment they are taken in: hardware and software.

Although, we are going to end up with comparison of selected criteria using relative numbers (percent), absolute values must also be provided, so that measurements can be repeated and compared. A list below describes a system configuration, which was used to perform tests:

  • GYGABYTE GA-Z170-D3H motherboard (with Intel Z170 chipset);
  • Intel Core i7–6700 CPU (4 physical cores, 8 virtual cores provided by hyper-threading, 3.40GHz);
  • 32 Gb of DDR4 RAM (2x Kingston KHX2666C15/16GX, 2133 MHz);
  • SSD 850 EVO SATA III (540MB/s seq. read, 520MB/s seq. write);
  • 64-bit GNU/Linux kernel (4.8.0–58-generic, elementary OS 0.4.1 Loki);
  • Python 3.4.5 (default, Jul 15 2016).

Test data set

As it was stated in the “goal” section, we are interested in processing of big batches of structured data, which has different data types: dates, strings, integers, floats and nulls, etc. So, before starting any measurements, a test data set must be prepared.

For this article we are going to take Airline On-Time Performance Data provided by Bureau Of Transportation Statistics. We will use an exported data set which includes data from all US regions for January 2017 (data can be fetched from here).

Our data set includes 450'017 records with following fields (see glossary):

┌────────────────────────────────┐
│ Table 1 — Data field types │
├───────────────────────┬────────┤
│ Name │ Type │
├───────────────────────┼────────┤
│ FL_DATE │ Date │
│ AIRLINE_ID │ UInt32 │
│ CARRIER │ String │
│ TAIL_NUM │ String │
│ FL_NUM │ UInt32 │
│ ORIGIN_AIRPORT_ID │ UInt32 │
│ ORIGIN_AIRPORT_SEQ_ID │ UInt32 │
│ ORIGIN_CITY_MARKET_ID │ UInt32 │
│ ORIGIN │ String │
│ ORIGIN_CITY_NAME │ String │
│ ORIGIN_STATE_ABR │ String │
│ ORIGIN_STATE_FIPS │ UInt32 │
│ ORIGIN_STATE_NM │ String │
│ ORIGIN_WAC │ UInt32 │
│ DEST_AIRPORT_ID │ UInt32 │
│ DEST_AIRPORT_SEQ_ID │ UInt32 │
│ DEST_CITY_MARKET_ID │ UInt32 │
│ DEST │ String │
│ DEST_CITY_NAME │ String │
│ DEST_STATE_ABR │ String │
│ DEST_STATE_FIPS │ UInt32 │
│ DEST_STATE_NM │ String │
│ DEST_WAC │ UInt32 │
│ DEP_DELAY │ Float │
│ TAXI_OUT │ Float │
│ WHEELS_OFF │ Float │
│ WHEELS_ON │ Float │
│ TAXI_IN │ Float │
│ ARR_DELAY │ Float │
│ AIR_TIME │ Float │
│ DISTANCE │ Float │
└───────────────────────┴────────┘

Fetched data is represented by a dynamically generated CSV file. Unfortunately, raw data are not suitable for us as-is and source file needs to be a bit normalized.

First of all, we need to strip away trailing commas at the end of each line:

sed -i 's/,$//g' source.csv

Next, we need to quote dates, so they will comply format of strings:

perl -pi -e 's/^(\d{4}-\d{2}-\d{2}),/"\1",/g' source.csv

Now we have a ready for usage data set with following attributes:

┌──────────────────────────────────┐
│ Table 2 — Source data attributes │
├────────────────────────┬─────────┤
│ Attribute │ Value │
├────────────────────────┼─────────┤
│ Number of records, # │ 450'017 │
│ Uncompressed size, MiB │ 95.3 │
│ Compressed size, MiB │ 13.1 │
└────────────────────────┴─────────┘

Measurement approach

We need to run several tests to measure attributes for primary criteria group. Logic of tests is same for all serialization formats being tested:

  1. Load data from source CSV with automatic type detection.
  2. Save that data multiple times using target serializer and measure average run time and RSS. Each test runs in a separate process and makes writes to a separate temporary file.
  3. Save a single permanent file.
  4. Load serialized file multiple times using target deserializer and measure average run time and RSS. Each test runs in a separate process.

Multiple tests for a single serializer are run in parallel to save time. This is achieved by using concurrent.futures.ProcessPoolExecutor with number of max parallel workers equal to a number of virtual CPU cores (8 cores for described environment). Process pool is recreated after all subprocesses execute 1 test run. This is done to avoid caching of memory. Number of tests per serializer equals double size of the pool, so there are 16 separate tests which are executed in 2 sequential butches containing 8 tests in each.

Evaluation of measurement errors is not done, as measurements provide accuracy which is enough to do comparison even with statistical errors being included.

All sources are available as a GitHub repository. Number of test cycles per serializer is configurable by --cycles argument for run.py


CSV

CSV (comma-separated values) is the simplest serialization format among all being compared in this article. Its support is provided by csv stadrard Python library. Alternative libraries such as pandas can be used as well.

CSV stores data in plain-text files, keeping each data record on a separate line. This allows it to operate with records sequentially. As result, it is possible to process files, which are much larger than available amount of RAM.

Records in a CSV file are organized as tables, keeping data fields of a single record separated by comma or a user-defined delimiter. If file lines contain trailing delimiters, then Python csv library will think there’s an extra nameless column with None values. That’s why we have removed all trailing commas from our source data file.

By default Python csv library does not distinguish data types and treats all values as strings. This can be very limiting.

To avoid such limitation, QUOTE_NONNUMERIC format option can be used. This will make Python to quote all string values and will allow to distinguish strings from numbers. However, there is still a nuance: all numbers will be treated as floats, which can be an inconvenient overkill. Moreover, explicit quoting of strings is very likely to increase size of stored data, because extra characters will be used. Consequently, parsing time will be increased as well.

One may use pandas or some other library to extend set of supported data types and to enable intelligent type detection. For example, pandas understands dates and converts them to appropriate type. It will produce output file with size which can be compared to size of a default CSV file with explicit string quoting. pandas saves data a bit faster, than default csv library, however it loads data dramatically slower.

Being schemaless and having poor type detection facilities, CSV does not provide a validation mechanism for stored data. It’s implementation is up to its users. As an example, we will use schematics Python library to define our schema with basic validation of data types during deserialization. Usage of this library can be an overkill for our needs and our data volume, however, it can show how much costs can be added to a SerDe by validation.

A table below shows differences in resource usage of described CSV serialization approaches.

┌────────────────────────────────────────────────────────────────┐
│ Table 3 — CSV resource usage │
├──────────┬───────┬───────┬─────────┬──────┬───────┬────────────┤
│ │ │ │ Load │ │ │ │
│ │ │ │ with │ │ │ │
│ │ Save │ Load │ valida- │ Load │ File │ Compressed │
│ │ time, │ time, │ tion │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ time, s │ MiB │ MiB │ MiB │
├──────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ default │ 13.8 │ 7.9 │ 544.0 │ 1385 │ 82.5 │ 12.2 │
│ quoted │ 13.9 │ 9.5 │ 578.4 │ 1127 │ 90.3 │ 12.5 │
│ pandas │ 8.5 │ 30.5 │ 539.5 │ 1017 │ 89.9 │ 13.0 │
└──────────┴───────┴───────┴─────────┴──────┴───────┴────────────┘

Next table provides information about compliance of CSV to other defined criteria.

┌──────────────────────────────────────────────────────────────────┐
│ Table 4 — Compliance of CSV to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ Mainly strings │
│ data types │ │
│ │ Numbers can be used also, but only as │
│ │ floats │
│ │ │
│ │ Additional types can be provided by │
│ │ external libraries at the expense of │
│ │ type-detection time │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Absent │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Absent │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ Yes │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Simple │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ Yes, can be accessed as-is or via │
│ │ shreadsheet applications like Excel │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ No │
└──────────────────────┴───────────────────────────────────────────┘

JSON

JSON (JavaScript Object Notation) is a light human-readable serialization format, which operates by collections of key-value pairs and ordered lists of values. It is a very popular format data used by modern web services.

It’s very easy to start using JSON in Python. JSON module from standard library provides extensive support of this format and allows you to configure many things, including rules for serialization of non-primitive data types.

It’s worth mentioning, there are plenty of 3rd-party libraries which can be used as an alternative to the json module from standard library.

One of such alternatives is UltraJSON library, which implementation is written in pure C language. Basing on implementation in C provides a high speed, however, it has a couple of drawbacks.

One such drawback is a limitation of supported data types: there is no way to implement custom serializer and deserializer like in case of standard module.

Another drawback is potential memory leakage caused by difficulty of robust memory management in C: a slight bug may introduce a big leakage while dealing with big volumes of data. Please, note: this article does not state that ujson library has a memory leak, however, it consumes almost double amount of RAM used by standard library, which is definitely worth mentioning.

Just like in case of CSV, JSON is a schemaless format, hence it does not provide data validation mechanisms. It’s up to user to implement them.

Talking about storing approach, JSON treats the whole data structure as a single data batch, hence, system must have enough RAM to be able to process data as a whole.

In the same way as CSV, JSON is stored as plain text, therefore it’s able to be comprehended by humans. Readability can be controlled by indentation level, however, indents bloat size of stored file and must be used only in special cases. For example, if a web service uses JSON for response serialization, an API endpoint can accept pretty flag in API request to produce response with indents. This can be very useful for examination by humans.

A table below shows results of resource usage measurements for standard json module and 3rd-party ujson library.

┌────────────────────────────────────────────────────────────────┐
│ Table 5 — JSON resource usage │
├──────────┬───────┬───────┬─────────┬──────┬───────┬────────────┤
│ │ │ │ Load │ │ │ │
│ │ │ │ with │ │ │ │
│ │ Save │ Load │ valida- │ Load │ File │ Compressed │
│ │ time, │ time, │ tion │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ time, s │ MiB │ MiB │ MiB │
├──────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ standard │ 45.5 │ 8.1 │ 584.4 │ 1111 │ 287.9 │ 25.8 │
│ ujson │ 2.5 │ 6.9 │ 580.8 │ 1962 │ 287.9 │ 25.8 │
└──────────┴───────┴───────┴─────────┴──────┴───────┴────────────┘

The following table provides information about compliance of JSON to other defined criteria.

┌──────────────────────────────────────────────────────────────────┐
│ Table 6 — Compliance of JSON to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ All primitive data types │
│ data types │ │
│ │ │
│ │ Complex data structures can be processed │
│ │ by implementing custom conversion rules │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Absent │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Absent │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Very simple │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ Yes │
│ │ │
│ │ Readability can be enhanced by using │
│ │ indents │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ No │
└──────────────────────┴───────────────────────────────────────────┘

JSON Lines

JSON Lines is a variation of JSON, which keeps each record on a separate line. This makes it convenient for processing large files having limited amount of RAM.

This format is very easy to implement. All that is needed is to serialize each item separately and write serialized data on a separate line. However, using new line as record delimiter implies that indentation cannot be used.

Similarly to the case with JSON, a table below shows results of resource usage measurements for implementations using standard json module and 3rd-party ujson library.

┌────────────────────────────────────────────────────────────────┐
│ Table 7 — JSON Lines resource usage │
├──────────┬───────┬───────┬─────────┬──────┬───────┬────────────┤
│ │ │ │ Load │ │ │ │
│ │ │ │ with │ │ │ │
│ │ Save │ Load │ valida- │ Load │ File │ Compressed │
│ │ time, │ time, │ tion │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ time, s │ MiB │ MiB │ MiB │
├──────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ standard │ 10.8 │ 10.0 │ 598.1 │ 1971 │ 287.9 │ 25.8 │
│ ujson │ 3.0 │ 5.8 │ 591.8 │ 1971 │ 287.9 │ 25.8 │
└──────────┴───────┴───────┴─────────┴──────┴───────┴────────────┘

As you can see, RAM and disk consumption is equal for both approaches.

Interesting, that comparing to plain JSON, save time for standard library has decreased by 76.3%, while load time has increased by 23.5%. As for ujson, save time has increased by 20%, while load time has decreased by 15.9%.

As for other criteria, obviously, all of them are almost the same, as for default JSON. The only difference is streaming support which is added, and indentation support, which is removed. The table below represents that.

┌──────────────────────────────────────────────────────────────────┐
│ Table 8 — Compliance of JSON Lines to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ All primitive data types │
│ data types │ │
│ │ │
│ │ Complex data structures can be processed │
│ │ by implementing custom conversion rules │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Absent │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Absent │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ Yes │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Very simple │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ Yes │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ No │
└──────────────────────┴───────────────────────────────────────────┘

msgpack

msgpack (MessagePack) is JSON-like serialization format with the exception that it stores data as compact binary blob.

Official Python implementation is provided by msgpack-python library, which is written in Cython. There are other libraries also, for example, u-msgpack-python, which is implemented in Python and can be vended as a single file. Performance of both libraries is compared in this section.

As it was already told, msgpack stores data in binary format. It is worth mentioning that msgpack-python by default restores all strings as bytes. This is true even for dictionary keys, which may be confusing. Two different options must be set for serializer and deserializer respectively to force strings be restored as strings. Note: this will make deserialization almost twice slower (by 72.4% to be precise). As for u-msgpack-python, it restores strings as strings by default.

Just like JSON, msgpack is a schemaless format and does not provide validation mechanisms.

Storing data in binary format makes it hard to be comprehended by humans. However, binary packing results in smaller size of output data.

In contrast to JSON, msgpack provides support for streaming out of the box, which is very good. However, streaming makes serialization by 41.3% slower. Deserialization has no visible impact.

The table below shows results of resource usage measurements for different usage scenario for msgpack-python and for one simple usage scenario for u-msgpack-python.

┌──────────────────────────────────────────────────────────────────┐
│ Table 9 — msgpack resource usage │
├────────────┬───────┬───────┬─────────┬──────┬───────┬────────────┤
│ │ │ │ Load │ │ │ │
│ │ │ │ with │ │ │ │
│ │ Save │ Load │ valida- │ Load │ File │ Compressed │
│ │ time, │ time, │ tion │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ time, s │ MiB │ MiB │ MiB │
├────────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ default │ 2.9 │ 2.9 │ N/A │ 1710 │ 250.6 │ 27.0 │
├────────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ with UTF │ 2.9 │ 5.0 │ 464.8 │ 1997 │ 250.6 │ 27.0 │
├────────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ with UTF & │ 4.2 │ 4.8 │ 470.1 │ 2015 │ 250.6 │ 26.2 │
│ streaming │ │ │ │ │ │ │
├────────────┼───────┼───────┼─────────┼──────┼───────┼────────────┤
│ u-message- │ 66.3 │ 72.2 │ 533.5 │ 1967 │ 250.6 │ 27.0 │
│ -pack │ │ │ │ │ │ │
└────────────┴───────┴───────┴─────────┴──────┴───────┴────────────┘

The other criteria are described in the table below.

┌──────────────────────────────────────────────────────────────────┐
│ Table 10 — Compliance of msgpack to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ All primitive data types │
│ data types │ │
│ │ │
│ │ Complex data structures can be processed │
│ │ by implementing custom conversion rules │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Absent │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Absent │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ Yes │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Very simple │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ Yes │
└──────────────────────┴───────────────────────────────────────────┘

Avro

Apache Avro — is a binary serialization format which relies on schemas.

Notably, Avro includes schema into stored data, so that readers do not need to know about it to understand serialized data. It may sound nice, but it’s hard to tell whether such feature provides some real value or not.

Usage of schemas provides support of basic validation, which is limited to data type validation.

Avro provides support of primitive data types as well as complex data types. Streaming is also supported.

As for Python support, Avro delivers official packages for Python 2 and for Python 3. There’s also a fastavro implementation available.

It may become a quest to get started using official packages. First of all, packages for different versions of Python differ. For example, in Python 2 schema is loaded using avro.schema.parse function, which accepts raw binary content of schema file. In contrast, in Python3 avro.schema.SchemaFromJSONData must be used, which takes schema, deserialized as JSON. However, these differences are not stated, as both libraries refer to the same documentation for Python 2.

One more thing is worth being mentioned. Official Python implementations of Avro provide support of streaming, however, they do not allow you to operate with batches of data. So, you have to save and load each record manually in a loop.

Finally, Avro produces the smallest raw data files, but it’s official implementation is almost the slowest SerDe being observed in this article.

As it was stated earlier, there’s alternative Avro implementation for Python called fastavro. Its documentation states that it outperforms default implementation, however, has less feature support.

Also it has a more simple API, which needs schema to be passed as decoded JSON.

In contrast with default implementation, fastavro allows data to be saved in a bulk.

The table below shows results of resource usage measurements for official Avro implementation (default) and for fastavro.

┌────────────────────────────────────────────────────────┐
│ Table 11 — Avro resource usage │
├────────────┬───────┬───────┬──────┬───────┬────────────┤
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ Save │ Load │ Load │ File │ Compressed │
│ │ time, │ time, │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ MiB │ MiB │ MiB │
├────────────┼───────┼───────┼──────┼───────┼────────────┤
│ default │ 262.8 │ 270.7 │ 1161 │ 74.0 │ 12.9 │
│ fastavro │ 33.2 │ 103.0 │ 1161 │ 74.0 │ 12.9 │
└────────────┴───────┴───────┴──────┴───────┴────────────┘

It’s visible, that fastavro saves data almost 8 times faster than its default counterpart. It also loads data 2.6 times faster.

Next table provides information about compliance of Avro to other defined criteria.

┌──────────────────────────────────────────────────────────────────┐
│ Table 12 — Compliance of Avro to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ All primitive data types │
│ data types │ │
│ │ │
│ │ Several comlex collection-like and │
│ │ record-like data types are supported as │
│ │ well │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Basic │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Absent │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ Yes │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Moderate │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ Yes │
└──────────────────────┴───────────────────────────────────────────┘

Protobuf

Protobuf (Protocol Buffers) — is a binary serialization format developed by Google with simplicity and performance in mind. It was designed to be smaller and faster than XML.

Like in case with Avro, Protobuf uses own schema definition language, which is agnostic on programming languages. To start using protobuf schema, a code generator for a target language must be used.

Protobuf supports primitive scalar types, enumerations, maps and unions. Fields can be defined as optional, required and can be organized into lists using repeated directive. Complex schemas can be created by nesting and grouping them. More information can be found in language guide.

What differs protobuf from previous serialization formats is its support of schema evolution. This is made possible by mandatory unique field tags, usage of optional directive and default values for added fields, etc. Check out documentation for more info about evolving schemas with support of backward compatibility.

Python support is provided by official protobuf Python package. Official documentation provides a guide for using it.

Notably, a code created by generator does not allow to create a message record from a dictionary. Firstly, a message object must be created and then each field must be manually set during serialization.

To convert message objects to dictionaries during deserialization, protobuf-to-dict or protobuf3-to-dict may be used. However, the former one has no support of Python 3 and the latter one has no support of Protobuf v2. So, it’s highly possible you may end up implementing own message converter.

It’s worth to mention how protobuf supports empty values. Even if field is defined with optional directive, protobuf does not allow None to be used as field’s value. Instead, fields with empty values must be skipped during serialization.

Lastly, protobuf does not allow to do streaming: whole data blob is treated as a single message, which can contain a collection of records within. Thus, streaming seems to be impossible, as whole message must be read at first.

A table below shows results of resource usage measurements for official Python implementation of protobuf. Information about loading data is present for raw message objects and objects converted to dictionaries manually.

┌────────────────────────────────────────────────────────┐
│ Table 13 — protobuf resource usage │
├────────────┬───────┬───────┬──────┬───────┬────────────┤
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ Save │ Load │ Load │ File │ Compressed │
│ │ time, │ time, │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ MiB │ MiB │ MiB │
├────────────┼───────┼───────┼──────┼───────┼────────────┤
│ default │ 10.6 │ 3.7 │ 128 │ 80.3 │ 15.1 │
│ as dict │ N/A │ 11.7 │ 1165 │ N/A │ N/A │
└────────────┴───────┴───────┴──────┴───────┴────────────┘

The following table provides information about compliance of protobuf to other defined criteria.

┌──────────────────────────────────────────────────────────────────┐
│ Table 14 — Compliance of protobuf to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ Primitive scalar data types, enums, │
│ data types │ maps, unions, groups, nesting │
│ │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Basic │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Present │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ Yes (field values just must not be set) │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ No, same as for Python lexical │
│ restrictions │ identifiers │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Moderate │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ Yes │
└──────────────────────┴───────────────────────────────────────────┘

Cap’n Proto

Cap’n Proto is another binary serialization format developed by author of protobuf v2. The library documentation claims it was developed taking into consideration years of experience of work with protobuf and feedback from protobuf users.

Also documentation of cap’n proto makes a notable claim that the library works infinity times faster than protobuf. As for me, this claim is a total nonsense, as protobuf takes finite time to execute, hence, execution time of cap’n proto must tend to zero. Official documentation clearly shows that this is exactly what was intended to be said. But is it possible for some useful work to be completed instantly? Law of energy conservation with common sense tell it’s not possible. So, what is the point of such claim? I guess it’s just a marketing move, however, I can hardly imagine it’s target audience who would believe in such claims.

From functional point of view, one may think about cap’n proto as an attempt to create improved protobuf. It provides support of primitive scalar types, enums, lists, groups and so on in a similar way protobuf does. The library uses schemas which have a syntax very similar to protobuf’s. There are other similarities which are described in schema language documentation along with differences.

Now it’s clear that cap’n proto is very similar to protobuf. At this point we could move on to measurements, but there are couple of things which are definitely worth to be mentioned: implementation and support of Python.

To start using cap’n proto with Python, compiled library itself and pycapnp Python bindings must be installed. They can be installed separately or cap’n proto can be installed automatically during installation of Python bindings. Even at installation point it’s possible to get lost. This is because of two things:

  1. It’s possible to install cap’n proto via system’s package manager (e.g., on Ubuntu, Ubuntu-based distros and on mac os), but version of library will be outdated. If you install the latest version of the library using system’s package manager and the latest version of pycapnp via PyPI (with --force-system-libcapnp flag passed, which is undocumented), you will end up with incompatible libraries.
  2. Documentation for pycapnp is very outdated. At this moment it supports version 0.5.4 while current version of cap’n proto is 0.6.1. Documentation contains steps for installation of version 0.5.0. Also it includes links to author’s fork of official repository. Those links are redirected to official repository, but you still can follow documentation and clone outdated fork instead of origin. Moreover, official repository points to that documentation also. Finally, the same documentation is available at 2 different addresses: capnproto.github.io and jparyani.github.io. So, if you will want to install libcapnp separately from pycapnp, it is very probable you will try to build them from sources, but you will not be able to compile or use pycapnp because its version will not be compatible with version of libcapnp. You may figure out that you have cloned a wrong repo after hours of struggling to compile it and to find a right combination of libraries. That can be horrible.

So, if you decide to use pycapnp, it’s better just to install a package from PyPI, which will automatically download and compile dependent version of libcapnp.

Next thing you may want to do is to define and compile a data schema. While doing this, you will find out that:

  1. Field names are restricted to be defined in camel case and to start with a letter in lower case.
  2. Unlike protobuf, cap’n proto does not have support of optional and required directives.
  3. Schema generator just does not work. It tries to import an own schema definition, which is present in package, but is not compiled. This makes generator to raise an error, preventing you from using it. As a workaround, schema can be loaded without being compiled.

Obviously, restriction on format of field names means that extra field name conversion work must be done before serialization and after deserialization. Such work barely can be automated and custom converter of field names must be used for each field of each record.

Next, absence of optional fields means that field values cannot be omitted. Cap’n proto does not allow to use empty values as well. This means that you need to use some flag as a value to indicate empty value. E.g., empty string can be used for string type and -1 can be used for numeric types. But this can cause extra complexity of understanding which value is real and which is just an indicator of empty value. For example, -1 value is not suitable for unsigned integers. But even if you find such values, which can be used as emptiness indicators, you will need to set them before serialization and analyze them after deserialization.

As for streaming, Cap’n proto does not support it like in case of protobuf. To save a batch of data, you will have to initialize an array of proper size and then fill it with records by iterating them. Record’s data is filled by manual assignment of values to fields. There is a support of multi-message files, but examples look suspicious to me.

Cap’n proto allows data to be stored in packed and unpacked format. Unpacked format is used by default and it takes 38.4% more disk space.

One more notable thing is a message size limit. It is controlled by traversalLimitInWords parameter. By default it equals 8 * (2 ** 20) and comments say it stands for 64 MiB. Also, comments state that this limit was introduced for security reasons. To load bigger messages with pycapnp, this limit needs to be increased using traversal_limit_in_words argument for read() or read_packed() methods.

Next, just like in case of protobuf, deserialized records are stored in data structures generated from schema. It may be OK to use them as-is, but also you might need to convert them into dictionaries to be able to process by other libraries.

Finally, there are a couple of words about bugs. They are present and everyone must be aware of them. It seems like the claim that cap’n proto library is infinity times faster than protobuf is true, because cap’n proto accesses data on demand: it does not load data until you need it. This means that if you “read” serialized data from a file, close file descriptor and try to access some record, you will get an error related to bad file descriptor being used. But even if you access all records while file is open to load all data explicitly, there is no guaranty that you will not run into an error. For example, my try to load packed data has raised a file descriptor error similar to an error described in this issue.

The table below shows results of resource usage measurements for serializarion and deserialization of unpacked data, deserialization of unpacked data with conversion of records to dicts and serialization of packed data. Measurements of deserialization of packed data are not provided due to the bug described in the previous paragraph.

┌────────────────────────────────────────────────────────┐
│ Table 15 — Cap'n Proto resource usage │
├────────────┬───────┬───────┬──────┬───────┬────────────┤
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ Save │ Load │ Load │ File │ Compressed │
│ │ time, │ time, │ RSS, │ size, │ file size, │
│ Approach │ s │ s │ MiB │ MiB │ MiB │
├────────────┼───────┼───────┼──────┼───────┼────────────┤
│ default │ 30.3 │ 1.4 │ 231 │ 163.1 │ 26.5 │
│ as dict │ N/A │ 279.2 │ 2753 │ N/A │ N/A │
│ packed │ 31.2 │ N/A │ N/A │ 100.8 │ 24.3 │
└────────────┴───────┴───────┴──────┴───────┴────────────┘

As it can be seen, loading data using internal structures takes only 1.4 second and 231 MiB of RAM. However, their conversion to dictionaries increases load time up to 279 seconds and consumes 2.75 GiB of RAM, which is 199 and 12 times more respectively.

The table below provides information about compliance of protobuf to other defined criteria.

┌──────────────────────────────────────────────────────────────────┐
│ Table 16 — Compliance of Cap'n Proto to other defined criteria │
├──────────────────────┬───────────────────────────────────────────┤
│ Criteria │ Compliance level │
├──────────────────────┼───────────────────────────────────────────┤
│ Supported set of │ Primitive scalar data types, enums, │
│ data types │ maps, unions, groups, nesting │
│ │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Available level │ Basic │
│ of validation │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Versioning │ Present │
├──────────────────────┼───────────────────────────────────────────┤
│ Support of empty │ No │
│ values │ │
├──────────────────────┼───────────────────────────────────────────┤
│ Streaming support │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Field name │ Camel-case only, field name must start │
│ restrictions │ with a lower-case letter │
├──────────────────────┼───────────────────────────────────────────┤
│ Learning curve │ Extremely hard │
├──────────────────────┼───────────────────────────────────────────┤
│ Human-orientation │ No │
├──────────────────────┼───────────────────────────────────────────┤
│ Computer-orientation │ Yes │
└──────────────────────┴───────────────────────────────────────────┘

Comparison

Now, after detailed examination of selected SerDe, it’s time to look at them standing in a line.

A figure below displays a chart with collected data about save and load time of different SerDe, including different options and different implementations. Note that time axis uses logarithmic scaling.

Figure 1 — SerDe save and load times (less is better)

There are several conclusions which can be made by looking at that chart:

  1. Even basic type validation implemented in pure Python will introduce a huge penalty for execution time. Even the slowest SerDe with built-in basic validation will perform better.
  2. Official Avro implementation is incredibly slow. fastavro improves position of Avro, but it is still far beyond sane values.
  3. If you still trust Cap’n Proto and you do not need to convert it’s records into something else, you can try to use it. But I would not recommend this due to high risk of getting into trouble.
  4. Protobuf looks quite good among the others formats, which rely on data schema.
  5. msgpack looks like a very confident solution among schemaless formats. Speaking about its streaming mode, it can compete with uJSON Lines.
  6. It’s highly probable that CSV is not an option for you, until your client forces you to use it, so that data can be imported into Excel or similar spreadsheet-editing software.

The following figure depicts a chart with information about RAM consumed by different SerDe during deserialization.

Figure 2 — SerDe RAM consumption during deserialization (less is better)

Note, that to make an honest comparison of RAM consumption it’s better to take into consideration only cases which can be casted to a common factor, which is dict for us. Dicts in Python are heavy. It’s much cheaper to use namedtuple or a custom class with __slots__, but none of schemaless libraries allow to use custom output format for that, hence, dict is the only option.

From such perspective, three major groups can be clearly outlined:

  1. Not hungry: CSV, JSON, Avro, Protobuf.
  2. Hungry: uJSON, JSON Lines, uJSON Lines, msgpack.
  3. Monsters: Cap’n Proto.

Finally, a figure below displays a chart with information about disc space consumption by different SerDe, including different options.

Figure 3 — SerDe disc space consumption (less is better)

Here we can see, that CSV, Avro and Protobuf have the smallest raw and compressed file sizes.

Notably, JSON Lines, which uses an extra message delimiter, does not consume much more space than plain JSON. Their raw size is not much bigger than size of raw msgpack file. Compressed files are almost the same for these formats.

Size of Cap’n Proto files lays somewhere between the others. Remember that Cap’n Proto has a limit for size of data being deserialized. Also, packed version of it can throw not funny errors on you.


Conclusions

The main goal of this article was to observe and compare popular serialization formats and to reveal peculiarities of their usage in Python.

Note, that schema-dependent formats, such as Avro, Protobuf and Cap’n Proto, also provide own RPC facilities in addition to serialization and deserialization. These facilities were not observed in this article because they use own IPC clients and may rely on technologies, which are incompatible with technological stack of your project. Please, refer to official documentation of those formats for details about their RPC facilities.

And now let’s recap several key points and draw a bottom line.

First of all, it’s clear that Cap’n Proto is and outsider among formats which were observed. It has a horrible documentation for Python implementation, it’s buggy, it consumes too much time and memory when deserialized data is converted into dicts, it forces you to use field names in camel case, it does not provide support of empty values and so on. I believe it will be improved along with it’s documentation and some day it will find its place on the market. However, at this moment it doesn’t look ready for production.

Secondly, Avro looks not so bad, but it’s overwhelmingly slow.

CSV might be an option, but only if there’s no choice. It’s a text-centric format with a limited ability to store numeric data. If there’s a hard requirement to store data in format, which can be accessed via Excel and so on, then it can be better just to export needed data to CSV instead of keeping everything in it. By the way, applications like Excel have a limit for a number of records which can be stored in a single file. So, if your data set it very big (hundreds of thousands of records), you will need to split it into chunks, which means that you will end up with export procedure anyway.

Next, if your data is structured, then Protobuf can be your big friend. It is really compact, fast and supports schema evolution. It can look a bit clumsy and it does not provide streaming support, but even if you need to convert its deserialized records into dicts, it can still perform better than schemaless formats with custom validation.

Finally, JSON, JSON Lines and msgpack can offer speed and streaming support. One of them will be the best fit, depending on your needs and requirements. If data needs to be introspected by humans (e.g. HTTP API), then JSON will be fine. Otherwise, give a try for msgpack.

As you can see, there’s no silver bullet and definition of the most suitable SerDe depends on needs and requirements. I hope this article helps you to make a wise choice. Feel free to share your experience and good luck!