Binary message encoding is a common pattern in modern distributed systems. Tools like Protobuf, Avro and Thrift are increasingly used to decrease payload size and provide typed contracts for message produces and consumers. Many developers will need to utilise these tools at some point, but they can be opaque and difficult to understand. I was tasked with creating a Protobuf schema recently and I decided to use some time to dive into the implementation details of Protobuf. In doing so, I realised that these systems contain some fascinating lessons about handling of binary data.
In this tutorial, we’re going to learn some of those lessons by making a binary message encoder in Ruby. I hope that by following the steps you will develop skills in binary data encoding, as well as an intuitive understanding of how message encoding protocols work and why they’re beneficial. We’re going to end up with something that looks a bit like a simplified Protobuf, although we’re not going to be faithful to the Protobuf spec. I encourage you to follow along and type the code in for each step: this will help you retain the information in a way that copy-pasting won’t.
Why encode messages?
Encoding messages provides a number of significant benefits. Firstly, a schema based encoding acts as a contract between the applications that write the encoded message, and the applications that read it. For example: if the schema specifies that the Person message has an “age” field that is an integer and is required, the reader application does not need to handle empty or invalid values for this field. The schema guarantees the correctness of this field by refusing to encode invalid messages. This makes the job of readers a lot easier than it would be otherwise.
Secondly, the schema acts as a form of documentation. Developers do not need to find examples of messages to write their reader code, they can simply reference the schema to find out what can be in the messages.
Finally, encoding can decrease the byte size of messages. In modern distributed systems, many millions of messages are sent around between services every day. Sending, storing, and reading these messages all incur financial costs which can be reduced through good encoding systems.
Step 0: Bits and Bytes
Let’s start with bits and bytes. We hear all the time about how computers represent everything as bits, but we rarely see them in practice. To begin with, I’ll write my name to a file and read it out again in binary. Use your name instead to see how it is represented in binary. Note that you may need to install the xxd utility for this to work.
Here I’ve written my name to a file and read back the file as bits grouped in bytes (the first group of zeroes is the line number) using xxd. For example, “r” is “01110010” which is the binary representation of the decimal number 114 (we’ll discuss how this number is mapped to a letter below). The last byte “00001010” represents a new line.
Before we get explore the details of encoding and decoding messages, let’s play a little bit with this simple message to get some intuition about how bits and bytes can be handled in Ruby.
Step 1: Output bit and byte representations
To start with, we’ll create a file called decoder.rb that can do something similar to the xxd tool we saw above.
Here, we read the bytes in, convert to their binary representation and pad with zeroes so each series is 1 byte (8 bits) long.
We can now call it with the filename as its argument:
Let’s make things a bit easier to read by also outputting the decimal number that is being represented:
Now let’s run it.
As you can see, the decimal number is actually the “byte” that Ruby reads in from the file. Ruby does the work of decoding the bytes into decimal numbers for us. Next, we’ll convert the byte to a Unicode character and add a little greeting.
Isn’t it sweet?
The Integer#chr method converts a numeric value to a character in the given encoding by using a “lookup table” (see here for a UTF8 lookup table). In my example, “r” is the Unicode value for 114. So, when we echoed “r” to a file, we actually wrote a binary encoding of the number 114 to the disk. Note the final number 10 is the UTF8 encoding of newline. Ruby will automatically convert this binary representation to text when we read the file, because it assumes that the file is encoded in UTF8 unless we tell it otherwise.
Step 2: A more complex message
I’m going to make the message more complex by adding my age:
This doesn’t look so good: I added the age to the message, but our decoder didn’t know that it was a different field. We’ll need to update our decoder:
A few points about this development:
1. Before, we were telling our decoder to ignore newlines, now we need to tell it to use it as a field separator. Not only that, but we’ve ascribed specific meanings to the first and second fields. This knowledge is essentially implicit: there is nothing in the message itself to describe this behaviour. If someone else needs to write some code to parse this message, they will need to inspect the message decoder or encoder to figure out what fields there are, what values they have, and how they are separated.
2. Note how many bytes we used to encode the number 35: 2 bytes. This is because we have encoded 35 as a UTF8 string. 51 corresponds to the UTF8 representation of “3” and 53 corresponds to the UTF8 representation of “5”. What’s interesting is that 35 can be represented directly in binary as “100011”, i.e. with only 6 bits rather than 16. To see for yourself, try running the following:
So essentially, we’ve wasted ten bits by defaulting to string encoding.
Let’s solve 2 first by encoding our numbers directly in binary rather than using UTF8.
Step 3: Variable encoding
To do this, we’re going to need to write a simple encoder.
This is a very naive encoder implementation. The String#to_i method will return a best guess at converting a string to an integer, falling back to 0 if no integer can be parsed from the string. If you pass a name with numbers like 5carl3t or an age with 0 in it, the encoder will break. But we’ll accept these limitations for the purpose of this exercise.
Before moving on, we should talk a little bit about the Array#pack method. As you can see, we call pack with a single argument, C. This argument is called a “template string” which refers to a specific binary encoding standard. In the same way that dates can be represented in lots of different string formats, integers (and other values) can be represented in lots of different binary formats. For example, C means “represent this number in 8 bits without a sign”, i.e. negative numbers should not be representable in this format. We could also choose another format to assign more bits if we wanted to represent numbers that require more than 8 bits.
Let’s see how our encoder now works end to end:
This is interesting. If you look at the “decimal” column, you can see that the number 35 has been written correctly in binary, but the decoder is converting it to a string. The reason it’s outputting “#” is that “#” is the character at position 35 in the UTF8 lookup table, i.e. the decoder doesn’t know that this byte should be interpreted as a number rather than a string. We’re going to need to make the decoder aware of what type of data it’s dealing with. To begin with, we can add a field type marker at the start of every field; since there are only two field types at present, we can use 1 bit to encode this information: 0 for string, 1 for integer:
We’ll also need to update our decoder to read this extra data:
Let’s run it now and see what happens:
As we can see, the decoder is now able to identify the different field types and handle them accordingly.
Step 4: Field numbers
You might remember that we introduced the concept of field types in order to save space. However, adding field type signifiers means we’re actually using more space now than we did previously (10 bytes vs 9 bytes for this message). The reason for this is that we’re using 1 byte per field signifier even though we’re only interested in one of those bits. Allocating 8 bits to represent field types allows us to represent 255 possible field types (255 is the max value that can be represented by 8 bits), but this seems a little bit excessive.
This extra space can definitely be used more efficiently: what if we used 4 of the 8 bits (known rather delightfully as a nibble) to represent a field number? This will allow us to give some semantics to our message by associating field numbers with field names explicitly. But we’ll need to do some Ruby bit fiddling to get this to work. Disclaimer: I’ve no experience in low level programming so my implementation is not likely to be particularly elegant.
Before we begin our implementation, let me demonstrate what I mean. A byte consists of 8 bits or 2 nibbles, each comprised of 4 bits. If we use the first nibble to represent a field number, say 3 (0011), and the second to represent a type, say, an integer (0001), the combined byte will look like this: 0011 0001. Let’s try this in Ruby’s IRB:
Our encoder will do something similar:
Let’s try encoding our standard message and piping through xxd to output the bytes.
Let’s just look at the first byte for each line. “00000000” is the field byte for the value “Ronan”. The first nibble is “0000” because this is the first field; the second is “0000” because it is of type string. The field byte for “35” is “00010001”. The first nibble is “0001” because this value is the second field number; the second is “0001” because it is of type int. Let’s update our decoder so that it is aware of the field numbers:
Putting these together we get this:
This is great! Now our message is structured with both field number and type.
Step 5: Writing with a schema
Now that we are able to represent field numbers and types, we have the tools we need to make a simple message schema that is decoupled from the encoder and decoder. A schema can be shared between message writers and message readers and allow them to make assumptions about what the message will include and how that will be formatted. This makes it easier to consume messages as you can just look up the schema in a standard place to understand what a message contains without needing to look at the producing code or talk directly to the team responsible for writing the message.
We’ll start with making a generic Message class that contains some functionality for defining a schema and storing attributes.
Then we add a simple message type that inherits from this class:
As we can see, the main work is going on in the Message class which allows inheriting classes to use a nice syntax for declaring fields. Now, instances of Person can be initialised with their attributes as follows:
Our encoder will use the generic interface provided by Message to encode messages in the same manner as we’ve already done:
As you can see, the encoding methodology is almost identical, except that now we’re using the object interface exposed by the Message superclass instead of rudimentary type checking. Let’s put it all together:
Calling write_message.rb should produce the same output as running the encoder with command line arguments from the previous step:
Step 6: Reading with a schema
We’ve now added a simple message schema that the encoder can use when writing messages, and that also exposes a handy interface for creating message objects. Let’s now see how this schema can be utilised to decode messages. First, we’ll add a couple of convenience methods to our `Message` class:
Next, we’ll rewrite our decoder so that it uses the message schema to decode messages:
Finally, a little script to tie it all together:
As we can see, the decoder is now able to read the message, parse it with the help of the schema, and cast it to a Person object:
Using the schema allows us to make stronger assumptions about what the message will look like, and it also enables us to wrap a useful Domain Specific Language (DSL) around the message.
Step 7: Variable field lengths for strings
Our message encoding system has now got quite a few cool features, but there’s one piece of technical debt we need to address. Like the CSV format, we are separating fields using a field separator, in our case “\n” rather than “,” or “;”. This means that string fields can never include newlines without breaking the encoding. Newlines are obviously very common in text, so this isn’t an acceptable limitation. We could change the separator to be a less commonly used character or series of characters, but we will still have to impose this limitation on clients. Moreover, even if we chose a really random bunch of unusual characters (such as “||*||”), we would be expanding the byte size of our messages which is a waste of disk space and network bandwidth. There must be a better way of doing this.
Luckily this problem is solved by existing message encoding schemes such as Thrift and Protobuf. The solution depends on the data type: for strings, an additional field is added to describe the length in bytes of the coming field; for numbers a specialised number encoding scheme known as varint is used. We’ll look at these in turn.
First, let’s update the .write_string method of our message encoder to add an extra byte to string fields describing their byte length:
Let’s resurrect our xxd clone from earlier to see how this changes things. This time we’ll read from standard input and omit the greeting.
Now put things together using a pipe to connect the two scripts:
As we can see, the second byte of the message now contains a binary representation of the number five, as there are five characters in “Ronan”. In addition, there is no longer a newline separator at the end of “Ronan” or “35”. Now let’s write the decoder code to handle this new format. It will suffice to rewrite the Decoder.decode method:
As we can see, we can no longer split by newline as before. Instead, we use a index which we move forward through the message as we process it. If we use the same read_message.rb script as before it will produce the same result:
This allows us to encode strings of variable length in our messages, but our integers are still limited to one byte. This might not be such a big problem with age, but it would be a problem for other use cases such as ids which can be extremely large. We won’t implement our own solution to this, but will instead use the varint implementation from Google’s Protobuf gem.
Step 8: Variable field lengths for integers
Google provide a good description of varints in the Protobuf documentation where they describe varints as “a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.”. This is achieved by using the leftmost bit (also known as the most significant bit, or msb) in each byte as a flag. If this bit is set, then there is another byte to come. If it is unset, then the current byte is the last byte in the integer. This means that integers of any length can be represented using the same format, without using more bytes than necessary apart from the small overhead of the msb flag. We’re not going to implement varints from scratch in this tutorial; instead I’ve taken the implementation from the Ruby Protobuf gem, added debug code and modified it extensively to make it easier to understand. We’ll create a new file for this module.
We can load this code into an IRB session and see how it works with different values.
The encode method works by taking the rightmost 7 bits of the value at a time, setting the msb and adding the resulting byte to the bytes array. The value is then bit-shifted 7 bits right, effectively discarding the bits that have now been processed. When the remaining value can fit within 7 bits, all of the pieces are serialised in binary format.
We can now try and decode to see how the decoding works:
As we can see, the byte chunks are read in, the msb is removed and the bits are shifted left. The shifted chunk is then combined with the value using binary OR, eventually giving the final value.
We need to remove the debug code and make one small change to the Varint.decode method’s return value before we can leverage it within our encoding flow.
Note that we have also updated the Varint.decode method to return the number of bytes in a varint alongside the actual value. This will make it easier for the decoder to handle varints.
Let’s update Encoder.write_int and Decoder#decode:
Now we can update our write_message.rb script with a higher age value:
If we pipe the output to dump_bits.rb we can see that the age value now occupies 2 bytes:
Our Decoder should be able to handle this:
Success! We are now able to encode and decode integers of arbitrary length in a space efficient manner.
If you’ve made it this far, I hope you have learnt something about encoding. We’ve started from the very basics and worked our way up to build an encoding system with some pretty cool features. I encourage you to dig deeper if it interests you. Martin Kleppman’s Designing Data Intensive Applications gives an excellent overview of the encoding landscape while the official Protocol Buffers documentation offers a well written and extensive reference on the specification.