The murder mystery of levenshtein algorithm

John Liu
6 min readMar 24, 2017

--

This appears to be an article on programming and computer algorithms. But it is actually a murder mystery.

Chapter 1 — the Algorithm

A good friend was doing an assignment with a levenshtein implementation in JavaScript. I didn’t realize he was studying programming again. And it was many years since I’ve seen that algorithm.

There is a saying about how no one ever invents new things now in computer science. I certainly think that’s not true. But curiosity did get the best of me and I went looking for the ‘best levenshtein’ algorithm.

https://www.npmjs.com/search?q=levenshtein*&page=1&ranking=popularity

Unlike in academia, in the real world, the way we implement levenshtein algorithm is thus:

> npm install fast-levenshtein

var levenshtein = require(‘fast-levenshtein’);

Students, welcome to the real world. It takes less time to have an algorithm at hand than writing up this story.

Now, this fast-levenshtein is a work of wonder.

https://www.npmjs.com/package/fast-levenshtein

It has 7.9 million downloads in one month.

Not only is it fast — beating the pants off other algorithms, it handles Unicode for breakfast, supports huge datasets and supports locale-sensitive string comparisons.

The Chinese says “I love you“, ”I call you” — distance 1

I’m surprised it even has two open issues — how can anyone have an issue with perfection?

But you are probably stuck at the 7.9 million downloads number. Just who needs a string-distance calculation algorithm that badly in the real world?

That’s a rhetorical question.

Students do. For their assignments.

Chapter 2 — the Assignment

Lets put ourselves in the shoes of the student for one chapter.

Student developers are no different to working developers. You have to juggle work - life balance. You need the beers to kill off your weak brain cells so the smart ones remain. You procrastinate. You said yes to too many assignments.

Now you are in trouble. The assignment is due. The client wants a demo on Monday. If you make him look bad, you don’t have budget for the next iteration.

So with 2 days left on your assignment, you reached the same conclusion as everybody else — let’s download this library from the Internet.

And holy smokes it is the bestest algorithm since the beginning of time. Nothing you did passes even half the unit tests that the professor gave you.

This fast-levenshtein completed them all before you hit the enter key to run the script.

The rapidly evolving JavaScript ecosystem hosted in the open on GitHub has ensured this algorithm will beat anything a human has ever written even if we go back to assembly and machine code.

You, have found the optimal algorithm, the best thing to solve your current problem. Congratulations for being really clever.

At work, we need more people like you.

You hit the submit assignment button. That’s one down, another 4 assignments to go.

Like your assignment, this chapter is also short.

Chapter 3 — The Spirit of the Algorithm

Now for this chapter, we need to take on the shoes not of the student. But of the client. Why does the professor want you to write this algorithm?

The algorithm compares two strings. How do we turn “kitten” to “sitting”? Change ‘k’ to ‘s’, change ‘e’ to ‘i’, add ‘g’ = 3.

But the strings may not all start at the same character — how do we turn the ‘dead’ into ‘undead’? Add ‘u’ and ’n’ = 2.

See human brains can do the calculation very quickly, but if you were to put it to a computer to do, it’s tricky. Do you swap, or do you insert? When do you iterate the index?

It’s a gimmick string algorithm, as humans we can understand it easily and it’s fun.

So the algorithm is interesting, it has plenty of recursion while taking the minimum steps. And while recursion is the easiest way to write the algorithm, if you increase the dataset text length the performance of the algorithm will absolutely crash through the floor while taking all your memory with it.

So you start to consider multi-threaded solutions or even dynamic programming solutions — there are loads of interesting solutions to this algorithm.

What’s easy to miss, or disguise, is what you can use the algorithm for. It turns out, the algorithm is not all gimmick — it has one critical real world use case.

The algorithm compares two strings. If you have a small number, that means there’s few changes you need to turn A into B. The strings can be the size of a paragraph or a book.

The algorithm compares two string. If you have a small number…

The algorithm detects plagiarism.

Finale

The student submits the assignment. The preconfigured unit tests runs… And is successful. The assignment is accepted.

The assignment deadline passes. And the professor runs the real, bigger set of unit tests.

Many student’s submissions will fail. Memory problems will take them out with a stack overflow. Locale problems. Unicode. Many solutions aren’t optimal and will not finish within the allocated time — for all the unit test know, it might have gone into an infinite loop.

Some student’s submissions will be successful. They will fly through the unit tests to the final boss…

And imagine the stage for the final battle.

The student’s algorithm, will be used to compare the student’s own code against that from all the well known npm repositories and each other, including previous years which the professor has conveniently recorded over time.

And it is an absolute bloodbath. We can’t look away. I’m certain, nor could the professor.

These algorithms are all returning minimum changes — in a sick, twisted, recursive and self-fulfilling way, the algorithms have entraped themselves and are found guilty.

If you didn’t understand the Spirit of the Algorithm. The best, optimal solution to the assignment, is also the direct Fail mark of the assignment.

There are some genuine gems that survived the algorithm bloodbath of 2017. Their reward is the halls of glory — where they sit with the winning student algorithms of 2016, 2015… all the way to 1965’s original algorithm.

The professor hands out the winning scores to students that has showed real genius in the field of algorithms, and silently incorporates the student’s algorithm into the test for next year.

There these algorithms will sit. Until they are called on in 2018, as part of the evolutionary programming algorithms to absolutely murder the 2018 entries in the bloodbath of years to come.

It is the The Matrix’s reset cycles. The Game of Thrones of our times, the Hunger Games of the ages. And it will continue as long as there are students and assignments.

To my friend

Who wrote his own algorithm. This short story is for you — hope you get a good score!

--

--