A gentle touch on CKY parsing with a working Swift example

Parsing lies in the heart of natural language processing, it assigns syntactic structure to a sentence so we can make sense out of it. As human, we take our extraordinary parsing ability for granted to an extend that we don’t even know it’s there. This ability forms the core of our human intelligence, and its details remains mysterious even until today. As the capabilities of our computing system increases, our ambition grows. More and more, people try to attack real-world problems by leverage language knowledge.

As programmers, we might even take programming language for granted and think it’s just the basic facility we can all use to make cool things. Anyone familiar with compiler will know parsing is the building blocks of any programming languages.

One of my life goals is to figure out how human brains process languages, but before that I find it very interesting to know how we can do it using software. I have a strong feeling that no matter how we solve this problem on computers, the solutions will shed some lights on the human linguistics. After all, we are solving problems of similar nature.

It’s frustrating that you don’t usually find working, complete programs in a book that’s supposed to tell you everything about an algorithm. Instead, all you got is some simplistic, hard-to-understand pseudo-code. It might be OK for academics who don’t usually write that much code, but for engineers like myself, sample written in real working code is much better. Also, I feel that in many books some important details about an algorithm have been omitted and thus making it harder to understand than it’s supposed to.

My goal here is to highlight some hard-to-grasp details (at least to me) in the algorithm and then provide a working code sample so every body can play with it and learn from it.

This piece shall serve as the first episode of a series dedicated on parsing algorithms.

I do not intend to give you thorough introductions to these algorithms since there are already plenty of books out there on these topics. You can get a copy of Speech and Language Processing if you are interested. I only strive to give you some useful stuff that complements these books.

I also do not intend to introduce rudimentary natural language processing (NLP) concepts like Context Free Grammar or lexicon. You can easily find them in textbooks or online. In order to understand the algorithm described here, you will need to at least know some basic elements in NLP and English grammar.

Cocke-Kasami-Younger(CKY) algorithm is a widely used parsing algorithm and it’s shockingly elegant and beautiful. The input to this algorithm is a language with set of well-defined grammar and a sentence in this language. The output is all the possible trees that describe the syntactic structure of the sentence.

CKY dictates that the input language must be in Chomsky Normal Form (CNF) instead of more prosaic Context Free Grammar. The definition of CNF is the following: the right hand side of each rule must expand to either two non-terminals or a single terminal. I will show you later why this requirement can help drastically simplified the algorithm. Keep in mind that any context-free grammar can be converted into CNF grammar without losing any expressiveness.

For the sake of this article, we define a mini language with the following production rules:

VP -> Verb NP
Verb -> book
NP -> Det Noun
Det -> that
Noun -> flight

As you can see these grammar is CNF-compilant.

Now let’s define our sentence as such: “book that flight”. So our goal here is to parse this sentence and get one or more syntactic trees describing the structure of it.

Now let’s start coding by defining our syntactic tree:

As you can see, this is just the most basic tree data structure, we need no more than this.

You might wonder how should we represent our language here in a way that is easy to work with. Let’s do it now:

The only confusing part of Language is the last recognize method. You might be curious why it takes array of trees as parameters, I will explain it later when we get to the algorithm itself.

The reason why I said CKY is shockingly simple is because it’s actually pretty dumb. It’s not trying to do anything fancy, rather, it’s just systematically trying explore all the possible possibilities. This might sound wildly inefficient, but with the aid of dynamic programming, the performance is usually very good in practice.

Now let’s forget about programming for a second and try to parse this sentence ourselves: “book that flight”.

Since we assume all the words are in our lexicon, so we start by assigning part of speech tag to each of the individual word:

book (Verb) that (Det) flight (Noun)

Now, we have completed the our first step: parsing substrings of the sentence that has length of 1 word. Let’s move on to parsing substrings with length 2. We do this by looking at two words together and gradually shift the 2-word window toward the right end of the sentence.

We start with first two words in the sentence: book (Verb) + that (Det)

According to our mini grammar, there Verb + Det are not sanctioned by any production rules. So we move to the right.

that (Det) + flight (Noun)

Our mini grammar says Det + Noun will give us NP (NP -> Det Noun), so let’s write it down:

that flight (NP)

Now we have completed second step, let’s move on to the third one: examining 3 words at a time. The trouble we went through to convert Context Free Grammar into Chomsky Normal Form gets paid off here: because CNF dictates that any non-terminal can only be composed from two sub-parts, we can safely try out all combinations of two parts in a string.

To be more mathematically accurate: let there be two numbers a, b where a < b and a indicates the start index of a substring and b indicates the end index of it. There must be another number s that serves as the index where this substring can be split into two parts ( a to s and s to b).

For the case at hand (book that flight) we can try:

a. book + that flight 
b. book that + flight

In a, we have:

book (Verb) + that flight (NP)

According to mini grammar (VP -> Verb NP), we have this:

book that flight (VP)

In b, since “book that” doesn’t belong to our grammar so we can stop immediately.

Therefore, our little manual parsing exercise produces the following syntactic tree:

Believe or not, at this point we have already covered the bulk of CKY parsing, the parser we are going to implement is merely to streamline the manual process we just went through together.

In chapter 13 of the book Speech and Language Processing, author did a great job introducing the algorithm in detail. What I will do here is not to reinvent the wheel, but try to illustrate 2 points I think have not been made clear in the book:

1. A step-by-step trace the algorithm on a small example. And do it with visual aid. 
2. Make it clear where dynamic programming is leveraged to speed up the performance.

Let’s continue to work on our example “book that flight”, only this time we will be doing it by tracing through our code.

This is a simple implementation of CKY parser:

I found it’s especially handy to use Swift’s optional binding when implementing algorithms like this.

This method takes a language as parameter, along with a sentence (broken into words) in that language. It will return a matrix of ParseTreeNode array. Don’t be intimidated by the matrix, it’s just an intuitive device to carry out CKY, you will find out soon enough during our tracing.

First thing we do is to create a matrix of ParseTreeNode array of the appropriate size:

Immediately after that, we convert each of our word into part of speech tag specified in our lexicon.

Why recognize returns an array of string instead of just one? Because in natural language it’s very common for a word to have multiple part of speech tag. We call this ambiguity. For example, the word “book” can be both verb and noun.

After these preparation, we have a matrix that looks like this:

Now let’s take a look at where the magic happens:

The outer for loop iterates over columns and inner for loop iterates over rows backwards. As we can see later this order ensures that we can boost performance by using [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming). Then the inner loop systematically tries out all the 2-parts combination of the current substring. To get a more clear understanding, let’s trace the algorithm on “book that flight”.

col = 0, row = 2, t = row = 2, while loop terminates immediately
col = 0, row = 1, t = row = 1, while loop terminates immediately
col = 0, row = 0, t = row = 0, while loop terminates immediately
col = 1, row = 2, t = row = 2, while loop terminates immediately
col = 1, row = 1, t = row = 1, while loop terminates immediately
col = 1, row = 0, t = row = 0, enter the while loop:

arr1 = table[row][t] = table[0][0]
arr2 = table[t + 1][col] = table[1][1]
arr1 now contains an array with one node in there, the non-terminal of this node is “Verb”. arr2 also contains an array with one node in there, the non-terminal of this node is “Det”.

language.recognize(arr1: arr1, arr2: arr2)

This will return nil because the combination Verb + Det is not sanctioned by any of out production rule. So we move on to next iteration.

col = 2, row = 2, t = row = 2, while loop terminates immediately
col = 2, row = 1, t = row = 1, enter the while loop

arr1 = table[row][t] = table[1][1]
arr2 = table[t + 1][col] = table[2][2]
arr1 now contains an array with one node in there, the non-terminal of this node is “Det”. arr2 also contains an array with one node in there, the non-terminal of this node is “Noun”.

language.recognize(arr1: arr1, arr2: arr2)

This will return an array with one new node because of production rule: NP -> Det Noun. So we place this node into table[1][2] :

if let nonTerminals = language.recognize(arr1: arr1, arr2: arr2) {
for nonTerminal in nonTerminals {
table[row][col].append(nonTerminal)
}
}

Now our matrix looks like this:

col = 2, row = 0, t = row = 0, enter the while loop

arr1 = table[row][t] = table[0][0]
arr2 = table[t + 1][col] = table[1][2]

It’s now easy to relate to the manual parsing we did. We are now splitting our strong into two parts: 0 and 1–2.

table[0][0] will give an array of one node whose non-terminal is Verb and table[1][2] will give an array of one node whose non-terminal is NP. Recall that we insert this node in a previous iteration. Because we stored the NP in table[1][2] now we don’t need to recalculate the what should be in table[1][2]. This is exactly what I meant by CKY is a dynamic programming algorithm: by caching previously computed value, later step can use it directly without re-doing work that’s already been done.

According to production rule VP -> Verb NP

language.recognize(arr1: arr1, arr2: arr2)

will gives us an array that contains a new node with non-terminal VP.

At this point we already have successfully parsed our sentence. For completeness, let’s finish our tracing:

Now t has been incremented to 1

arr1 = table[row][t] = table[0][1]
arr2 = table[t + 1][col] = table[2][2]

We are now breaking our sentence into two parts in a different way:

Since table[0][1] does not contain any node, the while loop will continue

if arr1.count == 0 || arr2.count == 0 {
t += 1
continue
}

Our method will return after this iteration.

In the returned matrix, we will find the root of our syntactic tree at table[0][2]. The tree produced is the same as our manually parsed one:

Hopefully, by tracing working code you can gain a more concrete understanding of how CKY parsing really works. Seeing how dynamic programming plays out beautifully in solving real-world problem is also pretty gratifying.

Again, the complete code and unit tests is available on Github