BK Trees: Unexplored Data Structure

Aditya Singh
Future Vision
Published in
3 min readMar 25, 2020
Basic BK Tree structure
BK Trees were proposed in 1973 by Burkhard and Keller, hence called BK Trees

BK Trees are data structures that are used to perform spell checking. They were first proposed by Burkhard and Keller in 1973. This algorithm uses two very basic concepts to improve the time complexity of various string matching operations. These concepts are:

  1. Triangular Inequality
  2. Levenshtein Distance

We will discuss both these concepts in detail and would try to code a simple BK Tree.

Triangular Inequality

This is a very simple mathematical concept which states that:

The sum of any two sides of a triangle is always greater than the third side.

We wont be going into its proof. The diagram below is self explanatory.

This diagram explains the triangular inequality i.e. why the sum of any two sides is always greater than the third side.
  • d(x,y) = 0 <-> x = y (If the distance between x and y is 0, then x = y)
  • d(x,y) = d(y,x) (The distance from x to y is the same as the distance from y to x)
  • d(x,z) + d(y,z) >= d(x,y)
  • d(a,b) denotes the distance between point a and b

Mark any point a line. Now imagine that you can pull that point. Keep pulling that point without reducing the length of the original line. One can easily observe that the sum of distance of the new point from the two ends of original line is greater than the original length of the line.

Levenshtein Distance

It is a form of string metric and is used to measure the differences between two sequences. Informally speaking it is equal to the minimum number of single character edits (i.e. Insertions, deletions and substitutions) one has to make so that both string are equal. It is also referred to as edit distance.

Algorithm and Code

To build a BK Tree one needs a set of words also called a dictionary.

The nodes of the tree will contain information about the words. The edge would show the Levenshtein distance between the words present in the nodes. For a particular Levenshtein distance there would exist only one edge i.e. only one child. If there exist another word with same edit distance from a node, we would move it further down the subtree of that node.

To build the tree choose any word which will form the root of your tree. From here, the construction of a BK-Tree is simple: Each node has an arbitrary number of children, and each edge has a number corresponding to a Levenshtein distance. All the sub-nodes on the edge numbered n have a Levenshtein distance of exactly n to the parent node. So, for example, if we have a tree with parent node “book” and two child nodes “rook” and “nooks”, the edge from “book” to “rook” is numbered 1, and the edge from “book” to “nooks” is numbered 2.

To add a word, find its edit distance to the root word. Move down the edge where edge weight is equal to the edit distance. Keep on recurring this thill you reach a node which doesn’t have any word in it. This node will act as a storage unit for your word. Do this for all the words present in your dictionary.

Now to match a string in our tree, we need a threshold value which will tell us the maximum deviation we can allow from our query string. To query the tree, take the Levenshtein distance from your query to the root word, and recursively query every child node numbered between

Levenshtein(root word,query)-THRS and Levenshtein(root word,query)+THRS

If the node you are examining is within the Levenshtein distance of your search term, return it and continue your query. Continue adding words and in the end you will have your list of words. I have added the basic implementation of the above data structure and its use in a simple case scenario.

Potential use cases of BK Trees

  • They can be used to find near matches of a string.
  • Spell checking.
  • Auto correct feature can be implemented with the help of BK Trees.

--

--

Aditya Singh
Future Vision

Competitive Programmer | Technology buff | Thinker