Hamming-Neighbourhood Distance

8 min read1 day ago

Foreward

We all knew what Richard Hamming did, but we all forgot it. A quick reminder would be sufficient or, perhaps a memory trigger might do it. The Hamming Distance was his creation. An algorithm that could check the similarity of two strings by tinkering with its bits and finding ones that were different. But despite it’s marvelous O(n) performance, it had one major drawback, it just could not compare two strings with different lengths.

but what if it could….

Introducing HN-Distance

Hamming-Neighbourhood distance or HN-Distance for short, is a smaller, equally faster algorithm (metric) to check the magnitude of the difference between two strings. For example, let’s take two random words “Kitten” and “Sitting”, their HN-Distance will be 3. It’s basically saying that “to transform the word Kitten into Sitting, 3 maximum efficient changes would be required” and this is actually intuitive. Let’s think of this as sets. If Set A contains K, I, T, T, E,N and Set B contains S, I, T, T, I, N, G.

So (A∪B) — (A∩B) are the characters that are uncommon. Mathematically:

n((A∪B) — (A∩B)) = difference of strings

So, simply talking, HN-Difference uses the set theory to detect the difference between two strings. If none of this made sense, don’t worry this is just the introduction.

Underlying Workings (Mathematical foundations)

Before actually understanding how it works, let’s first understand the what it works on.

Taking the example from before, we have the following sets:

While this is a good analogy, it is not exactly how HND works

Now we wanted to find the characters that are distinct i.e uncommon between the two sets. A simple way would be to first find the common elements and the union of the two sets, then subtract the common elements (intersection) from the union. This will give us the elements that are different from each other:

(A∪B) — (A∩B) (find the common elements and remove them from the total characters)

Let’s compute this step by step:

{K, I, T, E, N}∪{ S, I, T, N, G } = { K, E, T I N, S, G}

Next, let’s compute the intersection:

{ K, I, T, E, N }∩{S, I, T, N, G} = {T, I, N} (intersection)

Now let’s find the difference between these two resultant sets:

{ K, E, T, I, N, S, G } — {T, I, N} = {K, E, S, G} (this is no longer an english word)

Do you see what we get now? We are getting the unique characters, i.e the characters that are different, that make the strings unique. If we count the number of characters in the resultant set, we can get the Hamming Neighbourhood Distance which, in this case is 4. It is crucial to remember that the magnitude of HN-Distance tells you how many characters are present in each set (string) that make it unique. It is a mere set operation.

But this purely mathematical function has it’s flaws. For example, it can’t consider repeating characters. Why? Because sets are described as a collection of distinct objects, so that means that Set A cannot contain two T’s like it is supposed to (you spell kitten with two t’s), and while this seems OK for this case, suppose the second string does not have two T’s that can be factored out. Perhaps the 2nd string is “yummy”. Then, we will have serious problems, the HN-Distance that we will obtain at the end, will be inaccurate and wrong. We will see later that we need to consider repeating characters so that we get accurate steps for transformation. Next, we’ll look at the proper, functional implementation that is free from this problem.

Proper implementation (Computational foundations)

Let us now get to the real stuff, the actual implementation that is NOT centred around abstract sets. Real HN-Distance follows the following steps:

Split each string into an array
Convert each character to an ordinal point
Loop for n steps such that n = the length of the longer string (this is the part that allows HN-Distance to process longer strings, unlike H-Distance)
Check the difference between the ordinal points.

The above step is a bit vague, let’s see an example where string A is “kit” and string b is “sit”. After the strings have been converted into ordinal point arrays, we will have [107,105,116] and [115,105,116] respectively. So when we minus these arrays we will have: [-8, 0, 0] which is the delta array (the delta array is the unicode difference between the two strings, we will discuss this in more detail)

After getting the delta array (value of which is [-8, 0, 0]), remove all elements that are 0s. (This is an optional step, as we will see, HN-Distance implementations take a jmp threshold which decides whether or not to eliminate distances)

And that’s it! Now we must address the elephant(s) in the room, the terminology and the implementation.

Firstly the ordinal point. You see, unicode is a table or an international encoding for every single character known to computers and is a replacement for the older ASCII table. That being said, every character has it’s corresponding unicode value and that value (expressed as an integer) is it’s ordinal point. For example, for the letter “k”, it is U+006B which, as an integer, is 107. So basically ordinal point refers to the integer value we obtain after converting the unicode of a character to it’s corresponding base-10 (decimal) form.

Next, the unicode array. It is just an array of ordinal points.

Delta arrays sure sound fancy but it’s just a term for an array that contains the distances between each unicode equivalent value. For example if we were to create a table for unicode integer values from a-z, we would have “k” at 107 and “s” at 115 (a-z would not be a perfect incremental series), so the distance between them would be 8 (or -8, depending on how you look at things), so what the delta array really tells you is how to convert string A into string B. For the kit and sit case, we know that we would need to need to add 8 to the unicode equivalent of the first character to completely transform kit into sit. And that is actually intuitive. Think of it like this, you move 8 steps down the block to get to “s”.

If you change the ordinal point of a character, you will indeed change the identity of that character.

Hopefully you understand the foundations properly. Remember, understanding it will require some thinking. Nothing comes without a cost.

Practical Implementation

When it comes to learning something, examples are good, practical usage is better but learning to make it is just amazing. So we’ll do all three here.

For simplicity and comprehension, I’ll chose python and you can view the full implementation here https://gist.github.com/Hammad-hab/5b4fbb9f8b4eeda3ad9de7c1dce6b6a6

Let’s start from where everything computer-related once started, a blank file;

""""This file will one have a function one day. I'm main.py btw""""

and then let us define a function called “hn_distance” that takes in two strings:

def hn_distance(x1: str, x2:str):
    pass

As said in the steps above, we will first convert each string to an array and then convert each element into it’s unicode equivalent. Luckily, python has a handy ord function that does the heavy lifting for us:

def hn_distance(x1: str, x2:str):
    a = [ord(x) for x in x1]
    b = [ord(x) for x in x2]
    
    z = [] # delta array

Great! Now we have a function that converts each string to it’s unicode equivalent array and initalises an empty array (the delta array). The next step is not so easy. First we have to find the longer string, which is not much of a problem:

def hn_distance(x1: str, x2:str):
    a = [ord(x) for x in x1]
    b = [ord(x) for x in x2]
    
    z = [] # delta array
    ln = max(len(x1), len(x2)) # Give the longer string's length for looping

Secondly, we have to loop for ln steps (ln is the length of the longer string) and check the difference between them (i.e x2[i]-x1[i]). But we have to be cautious not to out run either string. So we’ll also check if the read point is within both strings:

def hn_distance(x1: str, x2:str):
    a = [ord(x) for x in x1]
    b = [ord(x) for x in x2]
    
    z = [] # delta array
    ln = max(len(x1), len(x2)) # Give the longer string's length for looping
    for x in range(ln):
        pass

Here is the general description for how we’ll do this:

if the observation point exceeds the length of string A, start pushing direct B ordinals codes to delta array, otherwise if the observation exceeds the length of B, start pushing direct A ordinals to the delta array. If the observation point does not exceed A nor B, push their difference to the delta array

def hn_distance(x1: str, x2:str):
    a = [ord(x) for x in x1]
    b = [ord(x) for x in x2]
    
    z = [] # delta array
    ln = max(len(x1), len(x2)) # Give the longer string's length for looping
    for x in range(ln):
        # if the observation point exceeds the length of string A, start pushing direct B ASCII codes to delta array
        if x > (len(a) - 1):
            z.append(b[x])
        # if the observation point exceeds the length of string B, start pushing direct A ASCII codes to delta array
        elif x > (len(b) - 1):
            z.append(a[x])
        else:
            # For now the observation point is within both strings, record difference in delta array
            z.append(b[x] - a[x])

Nearly there!, all we need to do now is return the delta array:

def hn_distance(x1: str, x2:str):
    a = [ord(x) for x in x1]
    b = [ord(x) for x in x2]
    
    z = [] # delta array
    ln = max(len(x1), len(x2)) # Give the longer string's length for looping
    for x in range(ln):
        # if the observation point exceeds the length of string A, start pushing direct B ASCII codes to delta array
        if x > (len(a) - 1):
            z.append(b[x])
        # if the observation point exceeds the length of string B, start pushing direct A ASCII codes to delta array
        elif x > (len(b) - 1):
            z.append(a[x])
        else:
            # For now the observation point is within both strings, record difference in delta array
            z.append(b[x] - a[x])
    return z

Done! Now let’s try our function with two random strings “otter” and “jam”:

x = hn_distance("otter", "jam")
print(x)

the result we get is [-5, -19, -7, 101, 114]

What do these numbers tell us? Well, that’s simple! It says that you need to add these numbers to the ordinal points of the second string in order to transform it into the first string. That being said, we add -5 to the unicode of the first character of the second string (“j”), then add -19 to the unicode of the second character of the second string (“a”) and so on (you get the point).

After performing this rather tedious operation, we transform string B into string A, therefore doing justice to the function of this algorithm.

Postword

So this was it, the Hamming Neighbourhood distance, an algoritm for efficient string comparision and transformation. Why will it be particularly useful? For efficient plain text searches. There are a lot of algorithms for checking the similarity of strings but they can be slow without optimization and do not tell you HOW to transform the string.

In conclusion you should use HN-Distance if you need to make these comparisions very often and want to spare yourself the time for optimizing it or need to know how to tranform them.

Following are some practical implications of HN-Distance:

Plain text searches (As mentioned before)
Keyword based search
Spelling checks (and transformation) etc

References

HN-Distance gist