Python Serialization Benchmarks

This post is focused on comparing performance, pros and cons of serialization libraries and formats in Python as I found no good online source comparing them (checkout the repo on github).

(jump to the bottom if you only want to see some shiny graphs)

JSON, the first thing we think when someone says “serialization” but its a lot more than this, even for Python there are multiple libraries just for handling JSON and of course JSON is just one format from a long list.

In a system at my workplace we found JSON serialization and deserialization a CPU intensive task. Using it for processing billions of messages daily made us thinking if we could speed things up.

lets first talk about terminologies:

Self Describing

First lets distinguish self-describing formats (like json/yaml) from not self-describing format (such as protobuf/pickle). Generally speaking, serializing format is self-describing if one can deserialize a message serialized in that format with no additional info. In JSON one can deserialize a message with no additional info while in pickle we’ll sometimes need imports and other globals.

Schema vs Schema-less

Some serialization formats requires prior definition of the schema (fields, types and/or their order) while others are schema-less. JSON, for example is schema-less while protobuf requires a schema - a .proto file.

To choose the right serialization for a system I summarized guiding questions:

Does speed really matters to us? What is the system data access pattern and expected growth in the near future?
if speed is not of a matter, usually prefer JSON.

Does the system (de)serialize large number of messages?
as rule of thumb less than several GBs / day usually not worth even thinking changing format.

Is it a new system that changes often and we want to keep flexibility of our data models?
if so, schema-less format is a requirement.

What is the serialize/deserialize ratio?
is the system write once read many (e.g. logging aggregation system) or one read one write (e.g. event based using celery)? take into account this ratio.

is the system interact with other systems with existing serialization formats?
if so think twice before changing the existing format(s)

Do messages consumed / created by humans?
human readable formats easier to manipulate, debug and understand than binary formats.

Does data mainly numeric?
some formats provides great compression ratio and speed for numeric values like HDF5 (not benchmarked here).

Does the system components are written in more than one language?
if so better to use format with well support by all of the system languages.

Serialization Formats and Libraries

JSON

Python’s JSON stdlib implementation (format library)

Pros:
• well known and widely used standard
• schema-less
• self describing
• human readable and writable
• in Python’s stdlib

Cons:
• relatively slow compared to other formats
• no binary support (usually use base64 encoding for binary fields)
• the serialized payload is relatively large (especially for number fields)

When to use:
This is my default format and unless given a special usecase should be the preferred option.

Ultra-JSON

Faster drop-in replacement for stdlib json (format library)

Pros:
• fast drop in replacement for the stdlib json
• almost identical api of dump & load functions

Cons:
• float precision is less accurate than python’s stdlib JSON lib
• not all optional arguments to stdlib’s json.dump and json.load are implemented

When to use:
If you want a relative easy speedup without changing format and dont care from floating point precision.

Parquet

efficient columnar data representation with predicate pushdown support (format library)

Pros:
• columnar format, fast at deserializing data
• has [good compression]{ensure} ratio thanks to its columnar storage
• good integration with pandas

Cons:
• slow serialization
• when data is not serialized as batches lose its effects
• requires a schema — less flexible

When to use:
Ideal for large files with a same structure with write-once read-many access pattern and when working with pandas.

MsgPack

It’s like JSON. but fast and small. (format library)

Pros:
one of the fastest schema-less self-describing format
• flexible — can serialize anything json can
• small serialized payload size, especially for numbers, bools and nulls
• great python bindings and docs. The API very similar to json dump & load
• support binary format as well as user defined extended types
• self delimiting, meaning messages can be streamed to a file/socket and deserialized one by one on the other side
• Packer & Unpacker classes for handling streams of data in memory efficient way

Cons:
• binary format not human readable or writable
• not common when working with web apps or rest APIs

When to use:
When we want to speedup but still keep flexible schema, when our system streams messages between services or into files.

Pickle

serializing and deserializing Python objects (format & library)

Pros:
• can serialize most python objects
• part of the stdlib — no external library needed
• relatively fast and flexible schema

Cons:
• only supported by Python
• not self describing — needs the correct imports and globals to deserialize
• needs same python version to work correctly
• its not secure to open pickled messages from untrusted users

When to use:
Generally better to avoid using it, in any case we must trust the source of the pickle object for security reasons. Used for example on multiprocessing to pass python objects between processes. Another use case might be storing state of python object that is hard to extract to pure “data” (e.g. json format) like scikit-learn models.

ProtoBuf

Protocol Buffers, language and platform neutral format for serializing structured data (format library)

Pros:
• brought to us by google, widely used on micro-services systems and event based systems
• small serialized size
• provides type checking
• enum support
• good integration with many languages (python, c, c++, java…)
• schema can be extended

Cons:
• largest caveat on python — every access to a member of the object (e.g. the string value) creates a new python object which takes back all the speed benefits.
• requires schema — if you got messages from multiple schemas or schema versions to deserialize in the same stream - you are in a trouble
• needs to be compiled (even on python), not so much of an issue but worth mentioning
• binary format — not human readable or writable
• python API is similar to the c++ API which is not so fun & pythonic

When to use:
Because the first con I would generally avoid using ProtoBuf at python unless you are integrating with system that already using it (e.g. java or c++ based system). one particular use case can be using the grpc protocol for real time communication between services.

BSON

Binary JSON (format library)

Pros:
• flexible schema & self describing as the name states - Binary JSON
• used by MongoDB — might be suitable if you’re using it

Cons:
• binary format — not human readable or writable
• except for MongoDB has no users
• python implementation is relatively slow

When to use:
Unless you’re sharing the data with MongoDB (e.g. backup MongoDB files) dont use it.

CBOR

The Concise Binary Object Representation by IETF (format library)

Pros:
• relatively fast
• schema-less & self describing
• IETF standard
• similar to msgpack pros, some say its same ideas different spec

Cons:
• lacks of good python library (not maintained, minimal docs and tests)
• really slightly worse performance compare to msgpack so why bother
• binary format — not human readable or writable

When to use:
I advise to use only if you have already running system using it, i prefer msgpack on Python.

To compare performance of these formats we’ll compare speed of serializing and desrializing schema and schema-less objects randomly generated with str (unicode & ascii), dict, list, bool, float and int types.

The Results

Talk is cheap - show me the benchmarks

serialize / deserialize time per library of a single object
serialize / deserialize time per library for an object averaged on 1M objects
serialize + deserialize time per library for an object averaged on 1M objects
object size per library for averaged on 1M objects in bytes (no compression)

The benchmarks data is a randomly generated list of objects with mixed fields of bool, int, float, str (with unicode), lists and dicts. for schema based formats (parquet & protobuf) we’re using namedtuples instead of dicts as input.

single object looks like this:

{
'title': str,
'author': str,
'sales': int,
'is_published': bool,
'languages': [str],
'reviews': [
{
'author': str,
'comment': str
},
{
'author': str,
'comment': str
}
],
'price': float
}

Note about ProtoBuf serialization — benchmarks are probably better, my code implementation is converting tuple to ProtoBuf object but on a real world usecase usage the msg is already a ProtoBuf object (no need to convert from tuple)

Machine Info: Linux 64bit, CPython 3.7.1 build: GCC 7.3.0 default-Dec 14 2018 19:28:38

Conclusions

Use JSON for most applications (stdlib or ujson), but when performance critical msgpack is a good alternative. For handling structured files columnar data formats such as parquet are good choice. In any always benchmark on your specific use-case before deciding moving from one format to another.

building big data systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store