Picking the right message encoder

We’re designing a messaging solution to be used in the SensorFleet product. The protocol will use some messaging library like ZeroMQ. With that in mind, we will also need a way to encode messages to be sent over the wire.

Consider this:

  • We will have a lot of different message types, even kinds we don’t have control of.
  • Messages/sec matters.
  • Size over the wire matters.
  • Ease of use matters. We probably don’t want to maintain 100 protocol definition files, unless there is a really good gain from it.
  • Our primary language is Python, but other languages need to be supported too. At some point, we might want to rewrite the messaging implemention in a language faster than Python, but right now we just want a Version 1 that is maintainable and “fast enough”. Also, we can’t assume that all 3rd party developers will want to use, or can use Python.

Instead of blindly choosing one that seemed good, I wanted to test the different encoders out how they perform in the real world. There are encoder benchmarks available, but I wanted to test our use case; a message envelope, that contains some metadata like ids, and a payload field, that contains a binary or text data between 0 to maybe tens of megabytes. Implementation in Python.

The code is available at GitHub: https://github.com/mkorkalo/encoder_benchmarks

Results with Python 3.6 implementations

│ Encoders │ 14B/sec │ 1M/sec │ Encoded size │
│ JSON │ 79698 │ 86 │ 112 │
│ Cap'n'Proto │ 177571 │ 3030 │ 80 │
│ BSON │ 23255 │ 2036 │ 104 │
│ MsgPack │ 159359 │ 4537 │ 74 │

Results tell us how many 14 byte and 1 Mbyte payloads were handled per second, also final encoded size for the 14 byte payload + envelope over the wire is given.

Disclaimer: I didn’t go into depths on why the performance looks like this. Also I noticed that my Cap’n’Proto implementation doesn’t having packing enabled. YMMV.

The Cap’n’Proto message spec is easy to write and I really like the idea of writing/reading directly to/from the wire format. It sounds like a huge improvement over Protocol Buffers. However, all of the protobuf-like encoders suffer from the same drawback: you have to maintain protocol definitions. Whether that’s a problem or not, depends on the use. We could use it with the message envelope, but probably not with the payload where we will have tons of different messages.

For all the other message encoders presented here, just take a Python dictionary and encode it.

JSON obviously has a pretty big drawback: it cannot efficiently encode binary data. This causes the high payload (1 MB of /dev/urandom) performance to suffer a lot. It’s fast with smaller payloads though.

BSON seems to be inefficient with small messages. I honestly don’t know what causes this and I also don’t have the time to research it. BSON also differs from JSON in other ways: it breaks compatibility. For example, you cannot have lists as top level items. I would like to see something that is fully compatible, in the data type sense, with JSON.

Behold, MessagePack comes to the rescue. It is surprisingly fast even though it has to encode field names from the dictionary to the wire format. It is nearly as fast as Cap’n’Proto with small messages, and actually faster with big payloads where the field name encoding is irrelevant. Maybe the Python implementation is optimized better than Cap’n’Proto’s. And it doesn’t need protocol definitions, you just:

encoded = msgpack.packb({"foo":"bar"}, use_bin_type=True)
msgpack.unpackb(encoded, raw=False)

Other related projects, that were not included in this benchmark:

Which one do you use and why?