Ternary Search Tree : Amazing Data Structure for near-neighbour, spell check, auto-complete

Ever wondered the data structure used for a most common use-case ( Auto-Complete ) ? Searching a ‘name’ in contact list of phone, near-neighbour look up of a given word, Spell-Check, URLs auto-complete in the brower with history of urls coming up in sorted manner.

How useful it is and how common it is :)

There are so many ways to solve this.

Sorted Input list, Prefix Trees etc…

With each different way, there is trade-off of memory utilisation and complexity.

What is the most optimum way from space point of view ?

Ternary Search Tree (rather a trie) (by Jon Bentley and Robert Sedgewick) : https://en.wikipedia.org/wiki/Ternary_search_tree

A tree with 3 children. Left and Right is same as in Binary search tree. Mid child contains next letter of the word which parent is part of.

Time Complexities : Similar to Binary Search Tree

Space efficient : Better space utilisation than other prefix trees

/ | \
a u h
| | | \
t t e u
/ / | | / |
s p e l i s
l --> put a marker here to say 'hell' is a word as well

Here the words are : “as”, “at”, “cup”, “cute”, “he”, “i” and “us”.

Root node here is seen as ‘c’, just because the first word which got inserted was ‘cute’. And other words like ‘as’ will follow left child of ‘c’.

The only different thought process to understand this trie is to distinguish left and right child as NOT A PART OF THE SAME WORD which its just immediate PARENT is of.

That is, ‘a’ looks to be left child of ‘c’ but its not a part of the word which ‘c’ is a part of.

Take the case of word ‘cup’, it might look like ‘t’ is a parent of ‘p’ but it is not. While traversing from ‘c’-’u’ when you do in-order traversal, you will ignore ‘t’ and move to left child of it which is ‘p’ so the word is ‘cup’.

One more very important optimisation is : you can put a marker at a non-leaf node to say its a word. So say there is a word “HELL” and a “HELLO”. The node which contains last ‘L’ will have a marker to say its a word in itself. It saves the space for a substring.

I have committed the code at : https://github.com/sisingh/tst

I have implemented wild-card ( e.g, str* ) and not ‘*’ in between the string. Because that is not what TST is used for. You can add the code for str*str1 search.

Let me know your thoughts.