Implementing a Fuzzy Search Algorithm for The Debuginator
I’m working on a generic debug menu, mainly intended for games and other real-time applications: The Debuginator. It’s written as a Public Domain C99 single-header-library, and if you’re interested, it’s available here:
the-debuginator - A juicy feature-packed debug menu intended for games.github.com
Here’s what it currently looks like with a default implementation and theme:
One of the things it needs to do is to allow the user to filter, or search, among the items to find the one she’s looking for.
Some background first. The UI design for the menu is based on the debug menu I made over the course of a few hack-days at Fatshark for Warhammer: End Times — Vermintide. By the time the game was released, we had a few hundred items in it. Browsing to each item manually became a bit of a hassle, not to mention that there was too much for a user to simply browse through it to discover new cool or useful things.
So I added a search mechanism, and it worked well, but it was pretty basic. For The Debuginator, I wanted something along the lines of the kind of fuzzy search that has become popular among various text editors. I believe Sublime Text started the trend. If you google for how to implement such an algorithm, basically the only resource that comes up is ForrestTheWoods’ Reverse Engineering Sublime Text’s Fuzzy Match, which is great, but, after having implemented this kind of filter myself for The Debuginator, I think another post on the subject could be useful. :)
Before I get into the algorithm I chose, I want to explain the structure of the debug menu. It’s nothing fancy. There’s folders, and (leaf) items. It’s hierarchical, so folders can have folders. Folders can have both folders and items in it, just like a folder in Windows can have both files and folders in it.
Here’s a folder in the root (which happens to also be a folder, but it’s invisible) called Folder. It has two items in it (SimpleBool 1 & 2) and the folder Subfolder. You could say that items are defined by their full path; their title and each of its parents’ titles.
Scoring and Mechanics
There a few things I wanted the fuzzy search to do.
- Ignore case — treat both the search string and the items as lower case. I miiight make it an option in the future but you rarely, in my experience, care about case.
- Disregard order. You don’t need to type “sub” before “bool” to include “Folder/Subfolder/SimpleBool4 with..” in the result. This is important because sometimes you’re not sure if it’s “AI Debug Navmesh” or “Debug AI Navmesh” or “AI Navmesh Debug” (etc.).
- Not require spaces between words in the search filter. It ruins the flow for me when I’m searching. “subbool” should work. Sublime and VS Code does this right, but other tools don’t (case in point: the original debug menu).
- Work on at least 10000 items without a hitch.
- There should be a way to unfuzzy it a bit and enforce a search with consecutive characters.
And of course, it needed to score certain things higher (this is the core of any decent fuzzy search, I guess). In my case:
- It should match on the whole path, but the leaf item’s title should be worth more.
- The beginning of a word. A word is currently defined as the title of a folder or item, split by spaces. Also, when switching from alphabetical letters to numbers counts as a word switch. So “gamebool01” yields a higher score for “Game/Gamebool010” than for “Game/Gamebool100”.
- The end of a word. It’s not quite as intuitive, and it yields a lower score than the beginning of a word, but the idea is to match “1” higher for something like MyItem001 than MyItem010. Maybe I should only do it for numbers?
- The length of the match. Searching for “aa” should score higher for Cars/Saab than Cars/Toyota. (But they should both match, because both have two a’s in them).
- [NEW] The length of the word. Searching for “spawnz” should prefer “AI/Spawn Zombie” over “AI/Spawn Zombie Necromancer”. Without this, there’s no real good way of explicitly finding Spawn Zombie (necromancer will be given same score and may be found first), and with this bonus, if you do want to spawn the necromancer, you can always search for “spawnzn”.
Worth noting is that in contrast to many other fuzzy searches, I don’t want to sort the results based on the score, I simply want to select the best one. The reason for this is partly due to the way the code is structured, but more than that, it is most intuitive way, and honestly, the only reasonable way, for the way the UI of The Debuginator is layed out.
Here’s what it looks like in action:
It’s not overly complicated, but it took me a while to figure out, so I thought it’d be worth giving and overview of it.
To start with, I convert both the filter and all titles to lowercase to get case insensitive search. If the filter string has a space in it, I run the matching in exact mode. More on that later.
In its outermost scope, we loop over all of the items. We’re looking for the best leaf item.
As we’re iterating over each item, we build up a string that contains the current full path. This is basically implemented as a single char array that kind of works like a stack: When we iterate over an item we append its title to the path. If it’s a folder, we add a “/”. When we are done with an item, we pop the top part off the path.
The benefits of this are: Just a single stack allocation, and I get to store indices to where the words start and end (which is later used for scoring), and I don’t have to copy more strings than necessary.
Now to the meat.
How do you figure out how to best match that filter against an item’s full path? How do we select Help, and not Theme, when we type in “hedeb” (HElp DEBuginator)?
The algorithm must test multiple ways to match every part of the filter, and then choose the best ones. Here’s how it does it. See if you can follow along my animation!
(Side note: How do people do these animations? I struggled my way through an online pixel animation editor for this one, but there has to be a better way. Please tell me!)
It starts at the start of the filter, and takes the first character: h. I call this the filter part.
It iterates over the first item’s path, “debuginator/help”. It finds it at index 12, and marks it down as a match of length 1. (I was too lazy to do the full animation so I skipped a few letters :D ) It then expands the filter part to include the next character, “he”. It notices that it still matches, so it changes the match’s length to 2.
After that, it’ll try to expand it again, but fail (hed doesn’t match hel).
After that, it’ll try to find the next match for just “h” (that’s right, it starts from the beginning of the filter part), but there is no more matches starting after the h in Help.
So we’re left with a single match for the “h” in “hedeb”. Scoring is quite arbitrary, but here’s my current formula:
int match_score =
(is_word_break_start * DEBUGINATOR_SCORE_WORD_BREAK_START
+ is_word_break_end * DEBUGINATOR_SCORE_WORD_BREAK_END
+ is_match_in_item_title * DEBUGINATOR_SCORE_ITEM_TITLE_MATCH
+ match_length) * match_length
The scoring weights are exposed in case the user would like to tweak things differently. These are the current values:
#define DEBUGINATOR_SCORE_WORD_BREAK_START 10
#define DEBUGINATOR_SCORE_WORD_BREAK_END 5
#define DEBUGINATOR_SCORE_ITEM_TITLE_MATCH 10
The “is_” variables are simply bools masquerading as ints — zero for false, one for true. In this particular case, the match would get (1*10 + 0*5 + 1*10 + 2) * 2 - 2 = 42 points. This is added to the running total of the item’s score (which of course begins at zero).
path_segment_modifier is a recent addition (see my list of criterias above). It’s value is the binary logarithm of the length of the path segment (part). So for a segment “Debuginator” it’s 3, for “Spawn Zombie” it’s 3, for “Spawn Zombie Necromancer” it’s 4. I may end up revising this to have a little more “resolution”, but it’s a start. We subtract this number because we want to give a lower score the longer the part name is.
After the best match has been selected, a flag is set in an array I call “taken_chars”. The array is initialized to zero for every item. When a character from the path is “taken”, such as h and e, the corresponding index in the array is set to 1. The algorithm will not consider characters which have been previously marked as taken by a previous filter part.
So we’ve now matched “he” to the best of our abilities. We move on to the next filter part. But, you might be thinking:
But hey, couldn’t there be a better match that had the e in it? You’re not checking that!
That’s right, and that’s a simplification that I’m making. I could make a huge tree of all the permutations of ways to match the filter to the search string, but A) that’d be much more complex, and B) it’d take much longer to execute, and C) I’m pretty sure it wouldn’t improve the scoring that much in almost all cases.
By the way, remember how I say that the path is sort of stored like a stack? That means the match_in_item_title is trivial to calculate:
int match_in_item_title = match_index >= path_indices[current_path_index];
The algorithm now advances the filter part pointer to the “d” in “hedeb”, and does the same thing there. It’ll start searching from the beginning, finds a “d”, notices it can add “e” and “b”, and gets the score for that: (10 + 0 + 0 + 3) * 3 - 3 = 39. It continues searching but there are no more “d”s in sight.
The resulting score for the item is 39 + 34 = 73. If no other item gets a better score, this one becomes the “hot”, or selected, item.
Let’s say we tried to match “adr” with an item with the path “AI/Debug/Draw data”.
First list of matches:
The best match for the “a” is the very last character in the path, due to my +5 bonus for occurring at the end of a word and +10 for being in the actual title of the item (13 = 1 for match length, +5 + 10, minus 3 for part length).
For the next character, “d”, the best option is the second one (“dr” matches “draw”).
Total score: 54.
You can see how consecutive letters really pay off. Is it too much? I don’t think so, but only testing will see.
As mentioned earlier, the user can enter exact mode by adding a space to the filter. I felt this was the most intuitive way to do this. Exact mode is kind of like doing a “whole word” search, like you find in any traditional editors.
The underline works as an indicator to show you that you are in a different mode. The exact mode simply ensures that each space-separated word in the filter must exist in the item’s path. However, it does not have to match a complete word.
I haven’t got any exact numbers yet. There is some optimization left to do, but here’s something at least:
In a release build I can search through 250'000 items with no noticeable frame rate hit.
In debug, there is a slight hitch — you can see it in the video above where I search for “boolgae001”, especially when I’m deleting characters, for some reason.
In debug with 10'000 items, again, there’s no sign of any hitch.
Of course this is a bit arbitrary, but it’s good enough that I’m content with the solution for now. It is a debug menu after all. I’ve run the test on my laptop and it’s maybe 50% slower, which is still pretty good.
Make the search work properly with UTF 8. I’m not sure if there’s much I actually need to do — I’ve never implemented UTF-8 in C before. If it ends up requiring too big of a change to the code base, I’d probably just don’t care about it and leave it in it’s ASCII-only state. For a debug menu I think it’s ok. If someone needs it, a good approach might be to fork the project and support it there. Suggestions?
Anyways, currently it doesn’t work, but that could have more to do with how I read input in SDL:
You know, looking at it like this, it seems really easy! And it is pretty easy, now that it’s done. :) It took me a number of attempts to get it right though, not just to get it working but actually figuring out how I wanted it to work in the first place, with sufficient speed.
If you’re going to implement a fuzzy search, hopefully this has been of some use. As I mentioned, the project can be found in the link at the beginning. Just to be clear though, the menu is in an early stage, and even though it almost has all the core features it really needs, there’s still a few things that are unfinished, hard coded, or unoptimized.
Oh, and I should perhaps end with this: I think it works really well. My test cases aren’t that large yet, but so far, when I search for something, the item I want to find is the one that gets found. Maybe this is because I’m the one who’s typing and also the one who wrote the algorithm… We’ll see. :)