Creating My Own String Diff Algorithm + JSFiddle

I’ll show you how it’s done, without confusing code snippets.

Here’s the demo JSFiddle (best viewed on a desktop computer)

Recently I’ve been captivated by the beauty and the challenge of programming algorithms. The first challenge (unrelated to the diff algorithm) that intrigued me was finding the “longest common subsequence” of two strings. Think for a minute how you might solve that. Believe it or not, it only takes 22 lines of JavaScript to solve it! Anyway, onto the thing you came here for: the diff algorithm.

From this webpage of beginner to advanced programming challenges, here was the prompt:

Given two strings, write a program that outputs the shortest sequence of character insertions and deletions that turn one string into the other.

Before you read on, give this a thought or two. See what you come up with and comment your ideas below! After a few hours of tossing ideas around in my head, I came to a small realization. If I could sort of “map” the target word (let’s call it buggy) onto the source word (let’s call it chug), this might lead me to the answer. I know that sounds abstract at the moment, but bear with me.

After more thought, I found a way to express this idea. My code would have to answer this simple question:

How many sequential characters in the target word (buggy) can we find in the source word (chug)?
“junk” added to stress the point. You can ignore it.

In this case, the answer is 2: u and g (see above). This is the core part of the algorithm. The rest of the algorithm just says “look around and between the characters that match. If it’s in chug, delete it. If it’s in buggy, add it.” (see below)

I told myself, yeah this seems doable, and started coding. Then I quickly hit a roadblock. If I looped through buggy train (the target word, see below) only once, it would look for a t after the b in boy, but it’d find nothing:

Clearly in this case, the best fit is to line up train (instead of b) in both words and go from there. Turns out my algorithm wasn’t thorough enough. Perhaps by looping over buggy train more than once, we could fix this problem.

What if we told it that after looping over buggy train once, it should store the following data: we found a b at the [o] index in buggy train that matches the b at the [6] index in train boy. Then we can tell it to cross off the b and pretend the word is uggy train. Now we tell it to repeat what it did before, store the matches once again, and cross off the u. Now the string should be ggy train. Keep repeating these steps until every letter is crossed off.

With this approach, we only need 2 iterations before we find the word train!

Notice how I labelled the characters with their index numbers. This is the data we store each time we loop through the target word. After we’ve completed all of our loops, we should have a data structure like this:

Now we simply pick the longest row of matches! But wait, something is off. The last row only has one match. It could easily have more:

Our matches are incomplete. To complete them, we need to backtrace a bit. For each row of matches, let’s repeat what we did before but in reverse:

“junk” added to stress the point. You can ignore it.

Finally. Now let’s pick the longest set of matches (i.e. train)— wait, how do we know the longest set of matches will lead to this answer?

Given two strings, write a program that outputs the shortest sequence of character insertions and deletions that turn one string into the other.

Think about it for a minute. If we found 5 matches in train boy and buggy train, we will ALWAYS need exactly 4 deletions and 6 insertions to transform train boy into buggy train. Thus, if we have 4 matches, we will ALWAYS need 5 deletions and 7 insertions to transform one into the other. So more matches is ALWAYS better.

Now we have the data to create something like this:

From here it’s simply a matter of squishing it together and voila!

We have a diff!

I would talk about optimizing this diff algorithm… though I figured that only those who plan to create their own diff algorithm would like to hear me talk about that. But for those who are curious, you can take a peek at my code.

If you want more articles like these, please let me know by dropping a comment and/or recommending it!

If you’d like to know what I do, I’m currently a Digital Media & Design student at the University of Connecticut, but most of my graphic design and programming skills are self-taught. I created this article out of a desire to make programming as fun for you as it is for me. If you’d like to connect with me, you can check out my website or find me on many social platforms by the username krabbypattified. Code away!!

P.S. No one I know makes coding more fun than this guy. Even if you’re not a programmer — ahem — especially if you’re not a programmer, it’d be well worth your time to check out his YouTube videos. Here’s a video for people who already have experience with JavaScript, and here’s a great playlist for those who don’t.