Trie: from theory to practice

Makhmud Islamov Sunnatovich
7 min readDec 16, 2019

--

For my CS 2.1 class I was assigned to implement a tree structure in one of my old applications. I used prefix tree in my Markov chain based “Matrix Quote Generator”. In this article I will go though the data structure, it’s operations and how I implemented it.

Right of the bat, let’s clarify the word trie and what is it!

The name trie comes from its use for retrieval. It is pronounced like “try” by some, like “tree” (the pronunciation of “trie” in “retrieval”) by others. Here, we will discuss a particular implementation of a trie, which may be somewhat different than how it is described elsewhere.

We use a trie to store pieces of data that have a key (used to identify the data) and possibly a value (which holds any additional data associated with the key).

Here, we will use data whose keys are strings.

Suppose we want to store a bunch of name/age pairs for a set of people (we’ll consider names to be a single string here).

Here are some pairs:

amy	56
ann 15
emma 30
rob 27
roger 52

Now, how will we store these name/value pairs in a trie? A trie allows us to share prefixes that are common among keys. Again, our keys are names, which are strings.

Let’s start off with amy. We’ll build a tree with each character in her name in a separate node. There will also be one node under the last character in her name (i.e., under y). In this final node, we’ll put the nul character (\0) to represent the end of the name. This last node is also a good place to store the age for amy.

      .     <- level 0 (root)
|
a <- level 1
|
m <- level 2
|
y <- level 3
|
\0 56 <- level 4

Note that each level in the trie holds a certain character in the string amy. The first character of a string key in the trie is always at level 1, the second character at level 2, etc.

Now, when we go to add ann, we do the same thing; however, we already have stored the letter a at level 1, so we don’t need to store it again, we just reuse that node with a as the first character. Under a (at level 1), however, there is only a second character of m…But, since ann has a second character of n, we’ll have to add a new branch for the rest of ann, giving:

     .
|
a
/ \
m n
| |
y n
| |
\0 56 \0 15

Note: Again, ann’s data (an age of 15) is stored in her last node.

Now, let’s add emma. Remember e is the first character and should go at level 1. Since there is no node with character e at level 1, we’ll have to add it. In addition, we’ll have to add nodes for all the other characters of emma under the e. The first m will be a child of the e, the next m will be below the first m, etc., giving:

          .
/ \
a e
/ \ |
m n m
| | |
y n m
| | |
\0 56 \0 15 a
|
\0 30

Now, let’s add the last two names, namely rob and roger, giving:

              .
/ | \
a e r
/ \ | |
m n m o
| | | / \
y n m b g
| | | | |
\0 56 \0 15 a \0 27 e
| |
\0 30 r
|
\0 52

Because the key for each piece of data is a sequence of characters, we will sometimes refer to that sequence as the keys (plural) for that data. For example, ann’s data is referenced using the keys a, n, n (in that order).

To better understand how a trie works, answer the following questions.

  • What would the trie look like if we now added anne with age 67? How about ro with age 23?
  • Would the trie look different if we added the names in a different order, say: rob, ann, emma, roger, amy?
  • Is this a binary tree, tertiary tree or what? In other words, each node has at most how many children?

Trie Applications: When and Why Use Tries

Logarithmic performance and linear memory isn’t bad. But there are a few more characteristics of our application domain that can lead us to better performance:

  • We can safely assume that all words are lowercase.
  • We accept only a-z letters — no punctuation, no hyphens, no accents, etc.
  • The dictionary contains many inflected forms: plurals, conjugated verbs, composite words (e.g., house –> housekeeper). Therefore, many words share the same stem.
  • Words have a limited length. For example, if we are working on a 4x4 board, all words longer than 16 chars can be discarded.

This is where the trie (pronounced “try”, do you remember?) comes in. Tries are neglected data structures, found in books but rarely in standard libraries.

For motivation, let’s first consider Computer Science’s poster child: the binary tree. Now, when we analyze the performance of a binary tree and say operation x is O(log(n)), we’re constantly talking log base 2. But what if, instead of a binary tree, we used a ternary tree, where every node has three children (or, a fan-out of three). Then, we’d be talking log base 3. (That’s a performance improvement, albeit only by a constant factor.) Essentially, our trees would become wider but shorter, and we could perform fewer lookups as we don’t need to descend quite so deep.

Taking things a step further, what if we had a tree with fan-out equal to the number of possible values of our datatype?

This is the motivation behind the trie. And as you may have guessed, a trie is indeed a tree, a trie tree so to speak!

But, contrary to most binary-trees that you’d use for sorting strings, those that would store entire words in their nodes, each node of a trie holds a single character (and not even that, as we’ll see soon) and has a maximum fan-out equal to the length of the alphabet. In our case, the length of the alphabet is 26; therefore the nodes of the trie have a maximum fan-out of 26. And, while a balanced binary tree has log2(n) depth, the maximum depth of the trie is equal to the maximum length of a word! (Again, wider but shorter.)

Within a trie, words with the same stem (prefix) share the memory area that corresponds to the stem.

To visualize the difference, let’s consider a small dictionary made of five words. Assume that the Greek letters indicate pointers, and note that in the trie, red characters indicate nodes holding valid words.

Tree vs Trie. Courtesy of TopTal

Trie Operations (in Python)

1. contains(string)

Return True if this prefix tree contains the given string.

2. insert(string)

Insert the given string into this prefix tree.

3. find_node(string)

Return a tuple containing the node that terminates the given string in this prefix tree and the node’s depth, or if the given string is not completely found, return None and the depth of the last matching node. Search is done iteratively with a loop starting from the root node.

4. complete(self, prefix=’’)

Return a list of all strings stored in this prefix tree that start with the given prefix string.

5. traverse(self, node, prefix, visit)

Traverse this prefix tree with recursive depth-first traversal. Start at the given node and visit each node with the given function

How I tried to implement trie?

For implementation of trie, I decided to revisit my old “Tweet Generator” now “Matrix Quote Generator” project.

The app takes source text, cleans it our from whitespace, punctuation and other symbols. Then my Markov chain function builds histogram (dictionary implementation) of user chosen order. From there, sentence generator function takes Markov chained histogram and generates sophisticated sentences. The sentences may not be super grammatical but they make sense. Side note: The original “Tweet Generator” supposed to produce funny drunk tweets, which usually are lack grammar but full of unintentional humor.

When it came to prefix tree implementation, I just utilized autocomplete function that we worked on during our CS class. Autocomplete takes prefix and returns full words that we have in given source text.

Final state of my “Matrix Quote Generator” allows user to do the following:

takes user input (prefix) on client >>> the app retrieves all words that contain given prefix >>> user chooses one word to generate a quote >>> the app generates random quote. If there is no word with given prefix, app will just generate random quote.

If you want to play around with the app here is the link. Please be patient with Heroku. And check out the code base as well if you are interested.

--

--