Bad programmers worry about the code.
Good ones worry about data structure and their relationships.
Strings can essentially be viewed as the most important and common topics for a variety of programming problems. String processing has a variety of real world applications too, such as:
- Search Engines
- Genome Analysis
- Data Analytics
All the content presented to us in textual form can be visualized as nothing but just strings. There are many applications which use strings as keys. (For example: Instagram uses your username as a key to identify you.) These applications can be made easily and more efficiently using the TRIE data structure. Before we learn about tries, let’s learn about prefixes.
The prefix of a string is nothing but any n letters n≤|S| that can be considered beginning strictly from the starting of a string. For example, the word “abacaba” has the following prefixes:
Now, we are all set to start!
What is a trie?
Formally defining, a trie, also called digital tree or prefix tree, is a kind of search tree — an ordered tree data structure used to store a dynamic set or associative array where the keys are usually strings.
The term “trie” comes from the word retrieval (the purpose for which this structure is generally used), and is usually pronounced “try”, to distinguish it from other “tree” structures.
A Trie is a special data structure used to store strings that can be visualized like a graph. It consists of nodes and edges. Each node consists of at max n children and edges connect each parent node to its children. These n pointers are nothing but pointers for each of the n characters of the alphabet used. A separate edge is maintained for every alphabet.
Unlike a binary search tree, no node in the tree stores the key associated with that node. Instead, each node contains the character which comes after the word formed by its parents in the possible key it leads to. Its position in the tree defines the key with which it is associated; i.e., the value of the key is distributed across the structure. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Keys tend to be associated with leaves, though some inner nodes may correspond to keys of interest. Hence, keys are not necessarily associated with every node.
A trie can be seen as a tree-shaped deterministic finite automaton. A basic trie looks like this:
(In the example shown, resultant keys are listed beside the nodes and values below them)
As shown above, all nodes which have no children, are always leafs (i.e., are keys and have corresponding values). However, the internal nodes may or may not act as leaves (i.e., are keys and have corresponding values). Like, in the example shown, “ac”, “apt”, and “do” have to be keys as they have no children, additionally, even “ap” is a key.
Note:- Every node in a trie has n number of children (where n is the number of characters in the alphabet used. Here, 26) even though they are not shown here to avoid clumsiness.
Before diving into the operations, we should first know it’s merits and limitations over monolithic string arrays so that we know when to use it and when not to.
- The biggest merit of a trie is that the time required for a search query is dependent only on the length of the word and not the number of words.
- With Trie, we can insert and find strings in O(L) time where L represents the length of a single word. This is obviously faster than BST(which takes O(L log n) time where n is the number of words). This is also faster than Hashing because- there are no collisions of different keys in a trie; buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value; there is no need to provide a hash function (which believe me, is one of the most difficult tasks) or to change hash functions as more keys are added to a trie .
- Another advantage of Trie is, we can easily print all words in alphabetical order which is not easily possible with hashing.
- Tries can quickly answer queries about words with shared prefixes, like- how many words start with “choco”? What’s the most likely next letter in a word that starts with “strawber”? This is the best method to use for predictive typing (or auto-complete)
Limitations (The major ones I could come up with):-
- The main disadvantage of tries is that they need a lot of memory for storing the strings. For each node we have too many node pointers(equal to number of characters of the alphabet). However, this can turn into an advantage when working with a set of strings where a lot of strings have common prefixes. For example- house, housekey, housekeep, housekeeper, etc.
- This does not work well with floating point numbers as keys.
- Most languages don’t come with a built-in trie implementation. You’ll need to implement one yourself. However, I don’t consider this to be a major limitation, I have just listed it for the sake of completeness. Because, where is the fun in coding if you keep using built-in services and can’t create for yourself.
Now that we know the merits and limitations of tries, and we can perceive when to use them, let’s dive into the operations.
In order to look for a key (say x) in a trie, what we basically do is, extract characters from x one by one and check if the block in the array containing links to the children nodes at index corresponding to our current character points to null or not. If it points to null, the key does not exist, and if it points to another node, we traverse to that node, extract the next character , and repeat the process.
However, there is a catch to this. Even if we find x in our trie, if the value stored in it is null, the key still does not exist. For example, our trie has a key- “algorithm”, and we are searching for “algo”. We obviously get a success while looking for “algo”, but it is not a key. Thus, this condition is a very important condition while searching and can not be skipped.
Time Complexity: O(L) where L is the length of x.
Example: Let’s look for ap in the trie shown above.
This is how we go about the process:-
In the above image, the current character at each node is underlined.
We let the user know that “ap” exists in the trie because there is a value assigned to it. On the other hand, if we were looking for just “a”, we would have to report that the key does not exist even though the prefix is present.
While inserting a key to the trie, we go through the above process. That is, we search if the key is present in the trie or not, because it is possible that we are inserting a key which is the prefix of another key in the trie. Another reason for doing this is that there is a possibility that our new key shares a common prefix with another existing key.
While searching, if the key exists as a prefix, we simply update the value of our current node. If the key does not exist as a prefix, we stop at the node and the character we could not find, and start making new nodes with values as null for each of the characters left(except, we assign the given value to the node corresponding to the last character).
Time Complexity: O(L) where L is the length of the new key.
Example: Let’s insert for apolo in the trie shown above.
This is how we go about the process:-
In the above image, newly inserted nodes are shown in blue, and the key to be inserted is enclosed in a rectangle with the current character underlined. The resultant key for each node is not shown to avoid clumsiness.
In the above example, if we had to insert just “a”, we could have simply changed the value of the first child of our root(labelled “a”) to the given value; and if we had to insert “aca”, we would simply go to the node corresponding to “ac” which is a leaf, and create a child with the given value for the node and label it “a”.
To delete a key, we do not delete the node corresponding to the key as it might have some children which still contain a key. Instead, we simply have to search for it and set it’s value to null.
However, to improve efficiency, if the node corresponding to the key has no children or all it’s children have null values, we might also delete the entire node.
Time Complexity: O(L) where L is the length of the key to be deleted.
Example: Let’s delete apt from the trie we started with.
This is how we go about the process:-
In the above image, the key to be deleted is enclosed in a rectangle with the current character underlined. The resultant key for each node is not shown to avoid clumsiness.
In the above example, we first change the value stored in our resultant node to n. Then, since it has no children, we delete the node itself, which results in the following tree:
However, if we had to remove “ap” instead of “apt”, we would not have deleted the node.
Okay, this is cool, but when and where can I use this?
As mentioned above, this structure works best when we are using strings as keys. The performance in terms of time is already among the best. However, in terms of space, it could use some improvements.
A notable point mentioned above is that the problem with its performance in terms of space is solved when using strings with similar prefixes.
Another very cool and notable point is, generally when working with strings as keys, the length of the words can be a fixed number. For example, your college identity or job id is always of fixed length. This vastly improves the search time as now, we are looking at a constant number of steps required to search any possible key, i.e., constant search time. Some examples of such a case include:
- My college identity(9 characters)
- My school identity(6 characters)
- Credit card numbers(16 characters)
Such cases even result in many common prefixes. For example, the first 2 characters in the identity for all the people who joined my school or college in a specific year would be the same.
Now let’s look at an example to see why Tries are a good substitute to hashing when working with strings:
Let’s say you are making a platform like Instagram where you let people choose whatever username they want to.
If you use hashing, first of all, you would have to come up with a really good hash function which you could use so that you get the least number of collisions possible. Still, there is a possibility of some occasional collisions which would require you to use a collision handling method.
Now let’s look at the time complexity of this method:
O(L) — to compute the hash
O(n) — for collision handling during insertion, searching and deletion(worst case)
Now, if you use a trie, all you would have to do is create the required trie structure. All the queries could be performed upfront without the need of any pre-processing.
The time complexity of this method:
O(L) — the best, worst and average case time complexity for all the operations.
The trie is a very specialized data structure that requires more memory than trees, lists and hashes. However, what it lags in terms of space, it more than makes up for it in terms of time. When specific domain characteristics apply, it can be very effective in addressing performance optimization.