Necktie knots, formal languages and network security

Sartorial mathematics and why we should care about the computational complexity of internet protocols

Mikael Vejdemo-Johansson
The Message

--

Do you like neckties?

How do you like to tie them?
4-in-hand? Windsor? Half-windsor?

Or maybe you (like me) jumped on the novelty tie knot bandwagon? Enjoy a Trinity? An Eldredge? Or any of the many many new knots that have shown up lately?

I used to wear bowties a lot. Two years ago or so, I’d insist on always wearing a self-tied bowtie, and I gathered up a good 15-20 different ones to swap between. Then, in December 2012, the Trinity and Eldredge tutorials by Alex Krasny (linked above) exploded over the internet. My wife showed me the tutorials, and I switched from bowties to neckties.

But what about the mathematics?

Once I had gotten the hang of tying these new tie knots, I started wondering about them. I had heard some time before that there was research on the possible tie knots, and so I looked it up.

Thomas Fink and Yong Mao published a Nature paper in 1999, a Physica A paper in 2000 and a book — The 85 ways to tie a tie — in 2001. The book is out of print, but the papers are accessible (as scientific papers go).

Their line of attack for counting possible tie knots was to create a formal language specification of necktie knots: a way to describe any knot with a symbol sequence in such a way that you can study the collection of all valid symbol sequences. Formal languages are a mathematical way of discussing how languages work: They define what a grammar is, and build a hierarchy of complexity for languages which still drives programming language design today.

Regions of the torso for the tie knot language

The core observation from Fink and Mao was that a tie knot can be described by a sequence of moves, telling you which direction the part you use to tie the tie drapes over the partially constructed knot. The idea here is that your torso is divided into three regions by the tie: Left / Right / Center.

The system proposed by Fink and Mao worked very well for their ambitions, but in the same way that simplistic assumptions about computer systems can lead to exploitable bugs, the assumptions Fink and Mao made reduced how many tie knots seemed to exist, and missed out on many of the beautiful novel tie knots that since have appeared.

For its legibility, I will describe their language using the symbols proposed by Carlos Castillo-Gorsaw:

Starting the tie
SiW Drape [S]eam [i]n, tie with [W]ide end
SiN Drape [S]eam [i]n, tie with [N]arrow end
SoW Drape [S]eam [o]ut, tie with [W]ide end
SoN Drape [S]eam [o]ut, tie with [N]arrow end

Continuing the tie
oL Wrap [o]ver the knot to your [L]eft
oR Wrap [o]ver the knot to your [R]ight
oC Wrap [o]ver the knot to your [C]enter
uL Wrap [u]nder the knot to your [L]eft
uR Wrap [u]nder the knot to your [R]ight
uC Wrap [u]nder the knot to your [C]enter
T [T]uck under a loop.

We can describe the 4-in-hand in this language as:

SiW oL uR oL uC T

Now, in Fink and Mao’s approach, they articulate several rules for combining these symbols to produce a tie knot:

  1. A tie knot starts with one of SiW oL, SoW uL.
  2. A tie knot ends with one of uR oL uC T or uL oR uC T.
  3. No two adjacent under moves are allowed.
  4. No two adjacent over moves are allowed.
  5. No two moves start and end in the same L/R/C-region.

Using these rules, and the observation that a tie knot that goes on for more than 9 moves uses up too much cloth, they were able to generate all sequences that satisfy all rules. There are 85 of them. They also proposed a string rewriting test for knottedness of the tie: will it dissolve automatically as you pull it apart? And they proposed a pair of numeric quantities that correspond somewhat to the aesthetics of the tie knot. Based on these quantities, they uncovered a list of 9 nice looking tie knots, widely extending the collection known at the time.

But what about all the new knots?

My own interest was most definitely piqued when I realized that the new knots that had caught my attention did not occur in this list.

Take the trinity, for instance. In this notation system we could write it down as

SiN oL uC oL uR oC uR oL uC T oR uL T

There are two problems with this knot in the Fink and Mao rules:
It doesn’t start right.
And it doesn’t end right.

It also has that sudden T in the middle, an option that was not taken into account when enumerating the 85 knots.

In particular, the prescribed starts mean that you could only really tie knots with the wide end. And the prescribed ends mean that you could only really tie knots that end with a flat façade: where you stretch the tie fabric across the knot, hiding all the previous work before you get to finishing the knot.

Both these go straight against the grain of what was interesting about the new knots. So I set out — and after a while recruited Meredith L Patterson, Anders Sandberg, Ingemar Markström and Dan Hirsch to help me — to revisit the enumeration. Our work has passed peer review, and is due to appear in PeerJ Computer Science by the end of the month. Clearly if we have knots that were not taken into account with the 85, then we should get more knots if we try to enumerate all of them.

Simplifying the language

The choice of a language, the choice of a grammar, constrains the possible output from a system. When describing necktie knots, the language limits output to what we expect to be valid tie knots — and preferably all of them. When analyzing computer security, the span of possible output or breadth of possible input defines the ease with which you can eliminate security-critical bugs and problems.

This language that Fink and Mao proposed is very nice for building instructions for tying ties. It is less nice for a theoretical analysis — and a theoretical analysis was what we needed to do to our ties.

The main problem in analyzing the language as stated was all the redundancy: all these rules excluding repetitions guarantee that any of the obvious ways to count all possible strings would be overcounting: We’d pick up and have to exclude afterwards things that might make sense but don’t fit.

This can be fixed however, by removing the redundancy.

Our first fix was to scrap the o/u annotations. As it turns out, this eliminates some novel tie knots — but if we are trying to establish a higher lower bound than 85, maybe we can survive capturing most but not all of the new knots? Our new assumption is that o/u always alternate and that the tie always ends with a tuck in the front. With this assumption we can just count backwards from the end and add alternating o/u annotations.

For the second step, we realized that the L/R/C progression and the rule about not repeating zones (another rule that eliminates some novel tie knots) means that from any one zone we have exactly two directions to go; and these directions fit neatly into the language of clockwise / counterclockwise.

Now, CW and CCW make for awkwardly similar and unwieldy symbols — so we decided to write Turnwise for clockwise and Widdershins for counterclockwise.

This all reduces the language to three symbols:

T — Turnwise move: L -> C -> R -> L
W — Widdershins move: L -> R -> C -> L
U — Tuck the tie under itself

Several knots such as the Christensen / Cross described by Fink and Mao or the Allwin or Van Wijk dive deeper: two, three, four earlier bows across the knot. Fink and Mao denote these by simply repeating the U symbol.

As long as you only care about single depth tucks, this new language is easy to handle. There is only one rule:

  1. U is valid after either TT or WW

So the entire language of tie knots can be described by the following regular expression

(T | W | TTU | WWU)*

We can easily generate all possible combinations of T/W up to a certain length — and we can easily find all occurrences of TT or WW in a generated string of Ts and Ws.

We did this. If you take the Eldredge as a gold standard for how many moves you can fit on a normal sized tie, then there are 4,094 sequences of Ts and Ws that end with either TT or WW. Summing up all possible choices of putting Us in these sequences we get a grand total of 24,882 possible single tuck tie knots.

24,882.

That’s a bit more than 85.

If we look at ties that can tuck deeper than just the single tucks, the number grows. We counted 266,682 of these tie knots.

Not only can we count ties, but since it is so easy to work with regular languages and regular expressions, we can build a toy to generate random knots. So we did.

This is as far as you have to read if all you care about are tie knots.
Next up, we’ll talk formal languages and eventually network security.

Formal languages and complexity classes

If you are trying to prove things about computers there are some famous models to use. It turns out that any sort of computation can be re-interpreted as trying to determine whether a string of symbols fits in a language defined by some abstract grammar, and computational complexity corresponds to the complexity of the grammar itself.

Chomsky (1956) proposed a hierarchy of language complexity types that capture some of the different theoretical levels of complexity that show up.

Regular languages: These are the simplest languages. Any finite language fits here. So does any language that can be recognized without using memory, and without going back to check again. All regular languages can be described by regular expressions (using string equality, branching between different options, repetition and optional presence — but not back-references) and they generate highly efficient computer programs.

Things like searching for text in text documents typically falls within regular languages: even if the text you are searching for is more complex — things like “find me all instances of «a» that precede a vowel” are easy to capture with a regular language. This is also the complexity class where Fink and Mao’s tie-knots language falls.

Regular languages fail on relatively simple tasks though: recognizing the language of all palindromes, for instance, is impossible with a regular language — because we cannot remember the part we need to match against. Thus, searching for all palindromes in a text document requires more computational complexity.

Context-free languages: These are regular languages with a bit of memory. A context-free language is allowed to use a single stack for storage to remember things for later — and thus is able to recognize palindromes.

Context-free languages can be found in DTDs: the formal specifications of XML-based document types. Hence, HTML, XHTML, SVG, RSS, and many of the technologies that build the internet are built on context-free specifications.

We were able to prove that no matter how deep you allow your tucks, even working with an arbitrarily long, infinitely thin tie, the language that describes the possible tie knots is context-free.

Context-sensitive languages: These are more complex; they are recognized by Turing machines with (linear) bounded memory usage.

Anything that has a variable length encoding will very likely end up being a context-sensitive language. Thus, examples here include HTTP — the web browsing communication system that lies under the hood, as well as protocols like ASN.1 and X.509 that form the backbone of encrypted and secure websurfing. We’ll return to these examples later; Meredith Patterson together with Len Sassaman and Dan Kaminsky used the high complexity of these protocols to demonstrate a large amount of separate security issues. (pdf here)

Recursively enumerated languages: The most general of Chomsky’s classes, these are recognized by unbounded Turing machines.

Many programming languages are equally powerful as an unbounded Turing machine. This is a class where everything that doesn’t happen to be simpler can be found.

We have a formal language for ties.
There is a way to classify formal languages.
So where do tie knots fall on this scale?

The boring answer is that because ties are finite length, tie knots are finitely many and therefore regular.

A far more interesting question ponders the infinitely long infinitely thin ties. Here, we can show that with slight restrictions, we still get a regular language:

Theorem: If a k levels deep tuck is the most that is allowed, then the language is regular and can be detected by a finite state machine with O(2^2k) states.

If we don’t restrict tuck depths, then the complexity class goes up. The point is that the TTU/WWU rule expands into a rule connecting U^k to the balance of #T vs. #W (mod 3) in the preceding 2k T/W symbols.

There is a context-free grammar that captures these knots: a grammar where you keep re-writing just one single symbol the same way no matter where that symbol sits. Next we have to figure out whether we could possibly simplify this to a regular grammar.

Regular grammars all obey a certain property, called the pumping property: you can split up any valid string into bits such that one piece can be repeated over and over again. We are able to write down one specific tie knot that won’t allow this. The repeatable string has to be part of a tuck description, and so eventually you’d tuck deeper than the existing knot. This language cannot possibly be regular.

Hence, tie knots are context-free.

And what about security?

I did promise something about network security in the title of this piece, didn’t I?

The connection is this: Provable security of the bits and pieces of software that make up this internet of ours is intrinsically connected to the Chomsky hierarchy of languages. Any protocol, specifying some forms of communication as valid and others as invalid, specifies a language. Most of these specifications do so in a standardized way, defining grammars for these formal languages.

If you try to implement a parser — any sort of code that tries to understand external input — then this matters to you. If your parser uses techniques from one level of the hierarchy, and the language they parse comes from a higher level, then whatever you parse is not going to be what you were trying to parse. Parsing XML with regular expressions is bad, because the nesting of XML tags means that XML is not a regular language. Your parser needs more complexity to be able to deal.

Similarly, if you try to parse something context sensitive — like, say, ASN.1 and most other things with sufficient dynamic length allocations — and you try to use something that cannot deal with context sensitive grammars? You won’t be parsing what you thought you were.

There will be bugs. And bugs are (often) exploitable.

The story gets worse. Suppose you have two grammars G and H — say the implicit grammars of two distinct implementations of the same underlying protocol. A natural question would be

Is the language specified by G identical to that specified by H?

In fact, when we try to avoid bugs rooted in misinterpreting protocol packets, this is a pretty crucial question to deal with. If the sender and the receiver actually work with different languages, there will be misunderstandings. These generate bugs, that in turn might generate exploits.

Here’s the good news: this question can be answered if both G and H specify grammars that are regular, or deterministic context-free.

Here’s the bad news: this question is undecidable in all more complex cases.

So you think you have written an ASN.1 parser? you think it conforms to the standard? Because of the inherent complexity of the language, you may well overlap quite a lot with the language — but it’s likely that you either admit things you shouldn’t, or you reject things you shouldn’t.

And this breaks things.

For more on this final perspective, I refer you to langsec.org.

--

--

Mikael Vejdemo-Johansson
The Message

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.