Writing a Bit Torrent Client: Step 1

I’m not sure I’ve been explicit about this, but I’m participating at the Recurse Center and I’m about halfway through, and figured I’d spend batch part 2 writing a BitTorrent client. At the very least, I’ll have some interesting components made if not the whole thing.

I’m working from Bram Cohen’s BEP 0003 spec, and the first item after a broad overview of how BitTorrent works from the application level is Bencoding. Bencoding is BitTorrent’s own message encoding system, and it’s meant to be terse and easy to parse.

For me, my mind went straight to grammars and Noam Chomsky and all the other cool folks who have profound thoughts on language and computer science. I knew writing a parser was in my future, and that got me excited. Parsers were things smart people write!

But first, I focused on encoding, translating the obvious Python equivalent types to bencoding’s types. Python’s bytestrings (‘bytes’), integers (‘int’), lists (‘[]’), and dictionaries (‘{}’) can all operate just great under these rules with small restrictions. For example, a Python dict’s key can be many types, but here that key must be a string.

encode_integer and encode_string are both fairly simple, just following the rules to turn their types into strings wrapped with the right flags. And encode_list and encode_dict recurse back to encode for the objects they can hold.

Doing it in reverse was more confusing. While I knew of grammars and how to make a parse tree, I’d never written a parser. My Theory of Language Design professor was old fashioned and believed that theory was best diagrammed/discussed on paper. But that’s good, because it gave me a great first step towards building a recursive descent parser.

I’m sorry?

Here’s my sketch of Bencoding grammar, following the rules that should apply to a recursive descent parser, the big one being no left recursion. That means I shouldn’t have B -> B or anything like that.

Terminals are symbols that can’t be replaced, so i, e, integer digits, and so on are terminals. Nonterminals are the symbols we can break own in our parse trees, to the point we have a string of terminals.

Each non-terminal becomes a function, and that function manages each of the options the terminal can become. And then it calls the function for whatever non-terminals it contains and returns those results, or parses and returns whatever terminal it found.

Here we have pieces of my Parser class, which has two fields to keep track of state, cursor and s. s is the string buffer, and cursor is the current spot the client is looking at. And a function that makes sure those variables are set properly for an unread, well-formed bencode bytestring.

And we have B, our most broad Bencode statement rule, which envelops all the other nonterminals, and which is always our root. It lacks terminals of its own, and just dispatches the appropriate nonterminal function for whatever character it finds. Luckily, there’s no ambiguity in bencoding, a digit means a string, an l means a list, a d means a dict, and a i means an int.

(As we can see, I found a collision between a variable name and a terminal name, and I choose to rename the terminal. What can I say? Naming things is hard.)

These two functions are relatively simple, and just translate the bytestring to their respective types with regexes and built-in python parsing functions.

And here we have the functions that put the recursion into this problem. I got a bit excited in the comments, because even after earning a degree and working at a place literally called the Recurse Center, divide and conquer recursion solutions are still exciting. It took me a bit of thinking for some reason to get how a kleene star (*) would work in logic, and I even wondered if it was legal. Then I realized a while loop would do a fine job, especially since the functions deeper in the call-stack would move the cursor to the next token of interest. While next token is not e, keep parsing. What to parse? I don’t know, it’s some flavor of B, let B() deal with it. And then it just works, whether we have an empty list, a list of dicts, a dict of lists, a list of ints, or some other wild structure.

And that’s my first parser.

See the whole mess at https://github.com/mccartykim/bittorrent/ in the file bencode.py