What is a Trie or Prefix tree

Linux School Tech
16 min readJun 29, 2024

--

A Trie, also known as a prefix tree, is a type of search tree used in computer science for storing a dynamic set or associative array where the keys are usually strings. Tries are particularly useful when you have to deal with a large number of strings and you need to perform operations like prefix matching, autocomplete, or searching.

A Trie, also known as a digital tree or prefix tree.

A Trie is a specialized tree data structure used for efficient retrieval of keys in a dataset. These keys are typically strings, and the links between nodes in the Trie are defined by individual characters of these strings. This structure allows for quick searches, insertions, and deletions of keys, especially when dealing with large datasets or when searching for keys with a common prefix.

Key Characteristics

  • Prefix-based Lookup: All children of a node share a common prefix of the string associated with that parent node. This makes Tries particularly useful for applications like auto-completion, spell checkers, and IP routing tables.
  • Memory Efficiency: While Tries can be space-intensive due to their nature of storing every character of a string, variations like Radix Trees (also known as Compressed Tries) optimize space usage by merging nodes with a single child.
  • Dynamic Operations: Tries support dynamic operations such as insertion, deletion, and search efficiently. The time complexity for these operations is generally O(m), where m is the length of the key.
  • Variety of Applications: Beyond string manipulation, Tries can be adapted for other types of keys, such as permutations of digits or shapes, making them versatile for various applications beyond just text processing.

Operations

Here are some common operations that can be performed on a Trie:

  • Insert: Inserting a new string into the Trie.
  • Search: Searching for a specific string in the Trie.
  • Delete: Deleting a string from the Trie.
  • Prefix matching: Finding all strings that start with a given prefix.
  • Autocomplete: Suggesting possible completions for a partially typed string.

Applications

Tries are commonly used in many applications, such as:

  • Autocomplete systems
  • Spell checkers
  • Text editors
  • Search engines
  • IP address routing (Read the suggested links at the end of the story)

Understanding the nodes and edges in a Trie tree is crucial for grasping how this data structure operates. Let’s delve deeper into these components based on the information provided in the sources.

Nodes in a Trie

  • Definition: Each node in a Trie represents a character or a part of a string. The root node, being the starting point of the Trie, represents an empty string. This hierarchical organization allows Tries to efficiently manage and retrieve strings, especially those sharing common prefixes.
  • Role in Strings: In a Trie, each node symbolizes a string, with the path from the root to that node representing the string in its entirety. This means that any path traced from the root node to another node symbolizes a word or a string, making it easy to identify and retrieve strings.

Edges in a Trie

  • Definition: The edges connecting nodes in a Trie represent individual characters. This means that traversing an edge essentially adds a character to the string being represented by the path from the root to that node.
  • Functionality: Each edge emanating from a node signifies a specific character. By following the path of characters from the root to a specific node, we can reconstruct the string associated with that path. This structural characteristic allows Tries to effectively share common prefixes among strings, leading to efficient storage and retrieval.
              (root)
/ | \
t b c
/ | \
o a a
/ \ | / \
p m t r t
| | |
h t c
|
h

In this trie:

  • The root node is empty.
  • The first level has nodes representing the letters ‘t’, ‘b’, and ‘c’.
  • The second level has nodes representing the continuation of the words: ‘to’, ‘ba’, and ‘ca’.
  • This continues until the words “top”, “tom”, “bat”, “car”, “cat” and “catch” are formed.

Each path from the root to a leaf node (or a node without further children) represents a word stored in the trie.

In a trie tree, nodes and edges have specific meanings and purposes:

Nodes

  • Root Node: The starting point of the trie, typically represented as an empty node. In the diagram, this is the node at the top labeled “(root)”.
  • Internal Nodes: These are nodes that have child nodes. They represent prefixes of the words stored in the trie. For example, the nodes labeled ‘t’, ‘b’, and ‘c’ are internal nodes.
  • Leaf Nodes: These nodes represent the end of a word in the trie. They do not have any child nodes. For example, the nodes labeled ‘e’ in “top”, ‘p’ in “tom”, ‘t’ in “bat”, ‘r’ in “car”, ‘t’ in “cat”, ‘s’ in “cats”, and ‘l’ in “catch” are leaf nodes.

Edges

  • Edges: These are the connections between nodes and represent a single character of a word. For example, the edge from the root to ‘t’ represents the character ‘t’.
  • Path: A sequence of edges from the root to a leaf node represents a complete word. For example, the path from the root through the nodes ‘c’, ‘a’, ‘t’ represents the word “cat”.

Detailed Breakdown:

Root Node

  • Empty node, the starting point of the trie.

First Level

  • Nodes: ‘t’, ‘b’, ‘c’
  • Each node at this level represents the first character of a word stored in the trie.

Second Level

  • Nodes: ‘o’ (from ‘t’), ‘a’ (from ‘b’), ‘a’ (from ‘c’)
  • These nodes represent the continuation of the words starting with ‘t’, ‘b’, and ‘c’.

Third Level

  • Nodes: ‘p’, ‘m’ (from ‘o’), ‘t’ (from ‘a’), ‘r’, ‘t’ (from ‘a’)
  • These nodes represent further continuation of the words.

Fourth Level and Beyond

  • Nodes: ‘h’ (from ‘t’), ‘c’ (from ‘t’), ‘h’ (from ‘c’)
  • These nodes represent the remaining characters of the words.

Examples:

  • Word “top”:
  • Path: root → ‘t’ → ‘o’ → ‘p’
  • Word “tom”:
  • Path: root → ‘t’ → ‘o’ → ‘m’
  • Word “bat”:
  • Path: root → ‘b’ → ‘a’ → ‘t’ → ‘h’
  • Word “car”:
  • Path: root → ‘c’ → ‘a’ → ‘r’ → ‘t’
  • Word “cat”:
  • Path: root → ‘c’ → ‘a’ → ‘t’
  • Word “catch”:
  • Path: root → ‘c’ → ‘a’ → ‘t’ → ‘c’ → ‘h’

In this trie structure, each node represents a state in the prefix of a word, and each edge represents a transition based on the next character of the word.

Another Example

let’s construct a simple example of a Trie tree using five words: “cat”, “car”, “bat”, “bar”, and “apple”. This example will demonstrate how to insert these words into the Trie and visualize the structure.

Start with an empty trie

Imagine a single root node without any children. This represents the starting point for all words.

Add the first word — “cat”

  • Traverse the trie from the root node.
  • Since “c” is the first character, check if there’s a child node with “c”. As this is the first word, there won’t be one.
  • Create a new child node under the root node labeled “c”.
  • Continue traversing. Since “a” is the second character, create a child node labeled “a” under the “c” node.
  • Finally, “t” is the last character. Create a child node labeled “t” under the “a” node. Mark this “t” node as the end of a word (often indicated by a special symbol or boolean flag).

Add the second word — “car”

  • Start traversing again. The first character “c” already exists, so move down to the “c” child node.
  • The second character “a” also exists, so move down to the “a” child node under “c”.
  • Here, “r” is different from “t” in “cat”. Create a new child node labeled “r” under the “a” node.
  • Since “r” is the last character, mark this node as the end of a word.

Add the third word — “bat”

  • Follow the same process as the previous words. Traverse to the “b” child node (created for “bar” later).
  • Since “a” is the second character, create a new child node labeled “a” under “b”.
  • Finally, add the “t” node and mark it as the end of the word.

Add the fourth word — “bad”

  • Traverse to the root. Since “b” is the first character, create a new child node labeled “b”.
  • “a” is the second character. As “bat” already has this path, no new nodes are needed here. Move down to the existing “a” node under “b”.
  • Finally, add the “d” node and mark it as the end of the word.

Add the fourth word — “bar”

  • Traverse to the root. Since “b” is the first character, create a new child node labeled “b”.
  • “a” is the second character. As “bat” already has this path, no new nodes are needed here. Move down to the existing “a” node under “b”.
  • Finally, add the “r” node and mark it as the end of the word.

Add the fifth word — “body”

  • Traverse to the root.
  • Since “b” is the first character, check if there’s a child node labeled “b” (already exists from “bat”).
  • Move down to the “b” node.
  • “o” is the second character. Create a new child node labeled “o” under “b”.
  • Continue with “d”, “y”, by creating new child nodes for each one under the previous node.
  • Mark the node for “y” as the end of the word (“body”).

If you encounter words like “bodybuilder” or “bodyguard” while adding words, treat them like any other word. Add them to the trie following the same steps as before.

in a standard trie tree, each edge is labeled with a single character because trie trees are designed for efficient retrieval and insertion of whole words, not parts of words. Here’s why labeling edges with every character is crucial:

1 — Unique Word Representation

Each character in a word acts as a unique step along a path in the trie. Labeling edges with characters allows the trie to differentiate between words that share some initial characters but diverge later.

For example, “cat” and “car” share the initial “c” but have different second characters (“a” and “ar”). By labeling edges, the trie creates separate paths for these words, making retrieval and insertion efficient.

2 — Fast Lookups

When searching for a word, you can traverse the trie by following the character-labeled edges. If you reach the end of a word (marked by a special symbol or flag), you’ve found the word.

If you reach a dead end (no edge with the next character), the word doesn’t exist in the trie. Labeling edges with characters allows for quick comparisons at each step, leading to fast lookups.

3 — Space Efficiency

While labeling every character might seem redundant, it actually saves space compared to storing entire words. Each character node can be shared by multiple words with a common prefix. This reduces memory usage compared to storing whole words independently.

Applications: Efficient Searching

One common problem that can be solved using a Trie (also known as a prefix tree) is efficient searching of a large collection of strings.

For example, suppose you have a dictionary containing millions of words and you want to efficiently check if a given string exists in the dictionary or not. A simple way would be to iterate over each word in the dictionary and compare it with the given string, but this approach has a time complexity of O(n*m), where n is the number of words in the dictionary and m is the average length of the words. This becomes impractical for very large dictionaries.

A more efficient solution is to use a Trie data structure to store the dictionary. Each node in the Trie represents a character, and a path from the root node to any leaf node corresponds to a valid string in the dictionary. To search for a string, we start at the root node and traverse down the edges labeled with the characters of the string until we reach either a leaf node (indicating the string is present in the dictionary) or a null edge (indicating the string is not present). The time complexity of this operation is proportional to the length of the string being searched, which makes it much faster than the naive approach.

Therefore, Tries are useful when dealing with problems related to storing and querying large collections of strings efficiently.

Complexity

The time complexity of efficient searching with a Trie depends on the length of the input string rather than the size of the Trie. Specifically, the time complexity of searching for a string of length `m` in a Trie is `O(m)`, since we need to traverse at most `m` nodes in the Trie to determine whether the string is present or not.

This is because each character in the input string corresponds to one level of traversal in the Trie, so the total number of steps required to find the string is equal to its length. Therefore, even for very large Tries, the time taken to perform a lookup remains constant, making them ideal for applications that require fast lookups of large datasets.

However, it’s important to note that building a Trie also requires some upfront cost. Inserting a new string into a Trie typically takes `O(m)` time, where `m` is the length of the string. So, while searches may be fast, insertions can take longer compared to other data structures like hash tables or arrays. Nevertheless, if your application involves frequent lookups and infrequent updates, then a Trie could be an excellent choice due to its logarithmic time complexity during searches.

Maximum Path Length

Here’s a list of English words sharing a common prefix:

Automated: automation, automatic, automatically
Computational: computation, computations, computational
Cryptographic: cryptography, cryptographically
Decipherable: decipher, decipherment, decipherable
Electrical: electricity, electrical, electronics
Geographical: geography, geographical, geographer
Horticultural: horticulture, horticulturist, horticultural
Interactive: interaction, interactions, interactive
Juxtaposition: juxtapose, juxtapositions, juxtaposed
Kinetic: kinematics, kinesiology, kinetic
Logistical: logistics, logistical, logistician
Mathematical: mathematics, mathematical, mathematicians
Narrative: narrative, narratives, narrator
Obstetric: obstetric, obstetrics, obstetrical
Photocopy: photocopier, photocopying, photocopied
Quizzical: quiz, quizzical, quizzes
Rhetorical: rhetoric, rhetorician, rhetorical
Scientific: science, scientific, scientist
Telegraphic: telegram, telegraphed, telegraphy
Unpredictable: predict, unpredictability, unpredictable
Vehicular: vehicle, vehicular, vehicles
Wrestling: wrestler, wrestling, wrestled
Xylophonic: xylophone, xylophones, xylophonist
Yogurtlike: yoghurt, yogurtlike, yogurts
Zoological: zoo, zoology, zoological

Based on the list of English words shared above, the maximal height (maximum path length) of the Trie Tree formed from these strings would be determined by the longest word in terms of character count. Among these words, the longest one is “unpredictable.” Including the root node, the Trie Tree would thus consist of 13 levels (including the root node).

Keep in mind that some words might overlap and share a few levels in the Trie Tree, meaning that the effective depth explored during traversals or queries might be less than the theoretical maximum height. Nonetheless, the highest level reached in the Trie Tree would still correspond to the longest word’s length plus one (for the root node).

Implementation of a Trie Tree

Here’s a Python implementation of a Trie Tree (Prefix Tree) that supports insertion and search operations. You can use this as a base class to build your desired application.

First, create a file named trie_tree.py with the following content:

class TrieNode:
def __init__(self):
self.children = {}
self.is_word = False

class TrieTree:
def __init__(self):
self.root = TrieNode()

# Add a word to the Trie
def insert(self, word: str):
node = self.root
for c in word:
if c not in node.children:
node.children[c] = TrieNode()
node = node.children[c]
node.is_word = True

# Search for a word in the Trie
def search(self, word: str) -> bool:
node = self.root
for c in word:
if c not in node.children:
return False
node = node.children[c]
return node.is_word

# Returns true if the given prefix exists in the Trie
def starts_with(self, prefix: str) -> bool:
node = self.root
for c in prefix:
if c not in node.children:
return False
node = node.children[c]
return True

if __name__ == "__main__":
trie = TrieTree()
words = ["apple", "banana", "appetizer"]

for word in words:
trie.insert(word)

print("Search apple:", trie.search("apple")) # Output: Search apple: True
print("Starts with ap:", trie.starts_with("ap")) # Output: Starts with ap: True
print("Search pineapple:", trie.search("pineapple")) # Output: Search pineapple: False

You can modify this Trie Tree implementation according to your needs. Some ideas include:

  • Implement auto-completion feature for commands or file paths.
  • Create spelling correction suggestions using a dictionary stored in the Trie.
  • Build a web scraper that extracts URLs and stores their domains in the Trie to detect malicious websites.

Remember to import the trie_tree.py module whenever you want to use the Trie Tree implementation in your Python projects. Simply add import trie_tree at the beginning of your script.

The provided code consisting of two classes: TrieNode and TrieTree. Let’s analyze both classes and the main block separately.

Class TrieNode

This class represents a single node in the Trie Tree. Its purpose is to manage the connections to its child nodes and indicate whether it marks the end of a word.

__init__(self) initializes a new TrieNode object. Child nodes are represented by an empty dictionary {}, and the Boolean value is_word indicates whether it's the end of a word (default is False).

Class TrieTree

This class manages the overall Trie Tree and provides methods for inserting and searching for words.

__init__(self) creates a new TrieTree object and initializes its root node.

insert(self, word: str) accepts a word as an argument and adds it to the Trie Tree. Starting from the root node, it iterates through the letters of the word creating new child nodes if necessary and sets the final node's is_word attribute to True.

search(self, word: str) -> bool returns a boolean value indicating whether a specified word exists in the Trie Tree. Iterating through the word's characters, it looks for corresponding child nodes until finding the last character of the word or reaching a dead end. At the end of the loop, if the last node's is_word attribute is True, the method returns True. Otherwise, it returns False.

starts_with(self, prefix: str) -> bool determines whether a given prefix exists in the Trie Tree. Similar to the search method, it iterates through the prefix's characters and checks for existing child nodes. Unlike search, it does not verify whether the last node ends a word, simply returning True if the entire prefix is found in the Trie Tree.

Main Block

The main block demonstrates usage of the Trie Tree class by inserting several words, then testing whether certain words exist or whether particular prefixes are present in the Trie Tree.

  1. Creating a TrieTree object named trie.
  2. Populating the Trie Tree with three words: “apple”, “banana”, and “appetizer”.
  3. Testing existence of the word “apple”.
  4. Verifying whether the prefix “ap” exists.
  5. Confirming that the nonexistent word “pineapple” doesn’t appear in the Trie Tree.

Overall, the provided code offers a solid foundation for developing custom applications requiring a Trie Tree data structure. Users can easily extend the TrieTree class to support additional functionalities depending on their specific requirements.

Using the trie tree defined above

To split a text filename into separate strings and construct a Trie Tree from those strings, follow these steps:

  1. Read the text filename and load its contents into a variable.
  2. Split the contents into individual lines.
  3. Combine consecutive whitespaces into a single space.
  4. Tokenize each line into words.
  5. Construct the Trie Tree from the extracted words.

Here’s a sample implementation in Python:

import re
from trie_tree import TrieTree

def tokenize(lines):
"""Tokenize each line into words."""
tokens = []
for line in lines:
words = re.findall(r'\w+', line)
tokens.extend(words)
return tokens

def normalize_whitespace(input_string):
"""Combine consecutive whitespaces into a single space."""
output_string = ""
prev_char = None
for curr_char in input_string:
if curr_char.isspace():
if prev_char != " ":
output_string += " "
elif curr_char.strip():
output_string += curr_char
prev_char = curr_char
return output_string

def build_trie_tree(file_path):
"""Build a Trie Tree from a text filename."""
with open(file_path, 'r') as file:
raw_content = file.read()

normalized_content = normalize_whitespace(raw_content)
lines = normalized_content.split("\n")
words = tokenize(lines)

trie = TrieTree()
for word in words:
trie.insert(word)

return trie

if __name__ == "__main__":
trie = build_trie_tree('example.txt')
# Use the created trie
print("Search apple: ", trie.search("apple"))
print("Search python: ", trie.search("python"))
print("Start with: app", trie.starts_with("app"))
print("Start with: py", trie.starts_with("py"))

Replace example.txt with your own text filename. The build_trie_tree function reads the file, processes the contents, extracts the words, and finally constructs a Trie Tree from the words.

Make sure you have already defined the TrieTree class from the previous example before running this code. After executing this code, you can work with the generated Trie Tree in the trie variable.

Sample Text File (example.txt)

the quick brown fox jumps over the lazy dog
a dusty old book lies on the shelf
the president gave a speech to the crowd
a large apple fell from the high tree
the children played in the park all day
the sun shone brightly on the summer day
the cat chased the mouse through the house
the train sped down the tracks
the boat sailed across the ocean
the plane flew high above the clouds
the stars twinkled in the night sky
the moon cast a silvery glow over the land
the fire crackled in the fireplace
the wind whistled through the trees
the rain poured down from the sky
the snow fell softly on the ground
the flowers bloomed in the spring
the birds sang their songs
the bees buzzed around the hive
the butterflies fluttered among the flowers
the fish swam in the pond
the frogs croaked in the marsh
the crickets chirped in the night
the owls hooted in the trees
the deer grazed in the meadow
the bears hibernated in their caves
the squirrels gathered nuts for the winter
the chipmunks hid their acorns in the ground
the birds flew south for the winter
the leaves changed color and fell from the trees
the days grew shorter and the nights grew longer
the air grew colder and the wind grew stronger
the first snowflakes fell on the ground
the children bundled up in their coats and hats
they went outside to play in the snow
they built a snowman and had a snowball fight
they laughed and played until they were tired
then they went inside to warm up by the fire
they drank hot chocolate and ate cookies
they told stories and sang songs
they were happy and cozy inside on a cold winter day

A Real Application

Read More

My YouTube Channel

More shell script videos and linux tutorials on my YouTube Channel.

--

--