Visual Programming with Elixir: Learning to Write Binary Parsers
Distributed programming isn’t the only area Elixir excels. One of the less advertised strengths of erlang’s little brother is that it contains one of the most elegant and compact ways to define binary parsers in a programming language. Code visually represents data structures making it easy to match and extract bytes and even individual bits in the function heads themselves. This capability makes Elixir an unexpected front-runner for learning and prototyping binary protocols.
Binary Pattern Fundamentals
Patterns are a deeply central component of writing idiomatic Elixir code. They are a visual representation of data-structures which act as guards in function-heads and case-statements and assign parts of the data-structure to variables. Elixir includes a number of binary patterns which are very similar to erlang’s.
Our First Parser
MessagePack should be considered the Rosetta Stone of binary parsers. With implementations in hundreds of languages, its compact but expressive grammar means that critical bits of the parsers are normally only a few hundred lines are so. Datatypes are clearly prefixed and there is no need for complicated text unescaping as with JSON.
Strings in MessagePack start with a byte-prefix of either
0xDB or the bit-prefix
101 followed by an unsigned-big-endian-integer defining the byte-size of the string.
Just like that, we already have a parser for a binary grammar in Elixir! Add booleans and floats and we have a parser capable of defining the basic types.
Adding collections gets interesting. Our parser needs to recursively loop over the binary input to build an accumulator. The parse function needs to change to return the value and the rest of the string.
Elixir makes it clear visually what you can expect for each byte. You can continue to build the parser yourself by implementing the rest of the MessagePack spec. The rest of this post instead will cover an important optimization: combining the list’s loop and extracting the value in the same function.
Optimizing the loop
To see what combining the loop and parser into a single function looks like, simplify the grammar so that we only have parse a list of floats. Instead of
parse and needing to unpack a tuple, we can directly parse the float in the loop itself.
This simple optimization can drastically cut memory usage and speed up your parser (sometimes as much as 2x). But there is a problem: we need to repeat the patterns from
parse_list so that lists can contain any element.
Macros are Elixir’s solution to this problem. We can define our patterns once and declare functions based on those patterns. However, macros tend to obfuscate the beautiful simplicity of Elixir’s pattern matching so they should only be used as an optimization step.
A more complete version of combining the loop with the parser would look something like this:
A pull-request based on this technique has been opened against MsgPax, the primary MessagePack implementation for Elixir, improving the performance and memory usage of the library. MsgPax already combined the loop and parser, but this pull-request further specializes the loop into multiple functions. As an exercise you can add the rest of the MessagePack datatypes.
The expressiveness of Elixir’s binary patterns was surprising to me when I first discovered them. They make parsing easy and free you from worrying about bit shifting and masking. This makes elixir is one of the easiest languages to explore binary protocols.
A special thanks to lexmag for inspiring this post!