Reds Internals: Searching and better searching with Node.js and Redis

No, I didn’t forget a letter.

This is part 16 of my Node / Redis series. The previous part was Redis and Node.JS Slug out: In-process variables and remotely stored data.

Reds is a Node module that implements a Redis-based search engine. It uses a number of natural language processing algorithms to provide more than an exact search. It was originally written by TJ Holowaychuk a few years ago, but, in normal TJ style the module is quick, not-overly-complicated (just about 360 lines with JSDoc comments!) and works great.

Using Reds couldn’t be simpler. You first have to create a search with a namespace:

search = reds.createSearch(‘my-search-namespace’);

Then you index your items:

search.index(‘This is my string of text. The text can be of long or short length.’, ‘id-to-indexed-thing’);
search.index(‘This is my other text string.’, ‘id-to-other-indexed-thing’);

Finally, you can query your items:

search.query(‘string’).end(function(err, ids){ … });

How does this work internally?

Before we get into the nitty-gritty of how reds stores things in Redis, we should first understand the processing that goes on to both the queries and the indexed item.

Let’s take two strings as examples:

This is my string of text. The text can be of long or short length.
This is my other text string.

The first thing that happens is to split up the string by words, it’s a simple regexp. Then the module removes any “stop words.” Stop words are insignifigant words in a search string that likely won’t matter — “a,” “in,” “of”, etc. After stripping the stop words, each word is stemmed, that is that the words are reduced to the simplest for — removing plurals, past tense, etc. This produces an array of words that are fairly recognizable. Our two strings look like this after processing:

[ ‘thi’, ‘string’, ‘text’, ‘the’, ‘text’, ‘long’, ‘short’, ‘length’ ] //’This is my string of text. The text can be of long or short length.’
[ ‘thi’, ‘text’, ‘string’ ] //’This is my other text string.’

Next the words are converted to metaphone strings. Metaphones are how the words “sound.” It’s a fairly complicated phonetic algorithm, but think of it doing things like: — Removing silent E’s — Converting soft K’s to C’s — Diphthongs are simplified
After this conversion, the string is looking more-and-more unrecognizable.

{ thi: ‘0’,
string: ‘STRNK’,
text: ‘TKST’,
the: ‘0’,
long: ‘LNK’,
short: ‘SHRT’,
length: ‘LNK0’ } //’This is my string of text. The text can be of long or short length.’

{ thi: ‘0’, text: ‘TKST’, string: ‘STRNK’ } //’This is my other text string.’

Finally, the words are counted, so you have data that looks like this.

{ thi: 1, string: 1, text: 2, the: 1, long: 1, short: 1, length: 1 } //’This is my string of text. The text can be of long or short length.’
{ thi: 1, text: 1, string: 1 } //'This is my other text string.'

Now we’ll insert everything into a Redis a couple of different ways.

First, we’ll index each ID with the metaphones that it contains:

> zrange my-search-namespace:object:id-to-indexed-thing 0 -1 WITHSCORES
1) “0”
2) “1”
3) “LNK”
4) “1”
5) “LNK0”
6) “1”
7) “SHRT”
8) “1”
9) “STRNK”
10) “1”
11) “TKST”
12) “2”

The fields (odd numbers in the result above) are the metaphones and the values (even numbers) are the metaphones.

Each metaphone is also indexed by word, so looking at TKST (the metaphone string for “text”) is indexed like this:

> zrange my-search-namespace:word:TKST 0 -1 WITHSCORES
1) “id-to-other-indexed-thing”
2) “1”
3) “id-to-indexed-thing”
4) “2”

The field is the id and the count of the number of times the metaphone is used in the indexed string.

Searching is accomplished by first a ZINTERSTORE into a temp key with each ‘word’ object. The query string is run through roughly the same process for indexing — the words are stemmed / stopped / and metaphoned and the remaining words make up the keys for searching. So, if you are searching for ‘short’ and ‘text’, the zinterstore looks like this:

ZINTERSTORE my-search-namespacetmpkey 2 my-search-namespace:word:SHRT my-search-namespace:word:TKST

This stored key (‘my-search-namespacetmpkey’) is the result set — we’ll get it with ZREVRANGE (because we want most relevant results first) and finally we’ll delete the temp key (Reds uses ZREMRANGEBYRANK to remove the results).

So, that’s how Reds works.

What now?

TJ exited the Node scene over a year ago. It was a big loss to the node community. I contacted him at that point and he made me collaborator on the Reds project. Up to this point, I’ve been answering questions and PRs on the project and mainly keeping the project going in the same direction. I’ve been quite conservative on the project for a couple of reasons:

1) While the numbers of depending packages arelow, it’s likely being included in final projects judging by the download numbers.
2) It’s pretty good as-is — the code is clean, the performance is good. What do they say about fixing not-broke things?

Keeping point #1 in mind, I’ve resisted anything that changes greatly the way indexing is done — I wouldn’t want someone who has this in production to upgrade to the latest version to find out that their existing index is no longer valid. With point #2, you can monkey patch the project very easily — if you don’t like the way something works, just change a function.

That being said, the project has a number of weaknesses that need to be addressed. First, it is setup for English language only. The stemmer, stopwords and metaphone are all English specific — running other languages through it produces downright weird results. Also, the search results are purely based on the frequency of words — extending the project to include other items is possible but clunky at the moment. Finally, I’ve gotten a PR that wants to add significant functionality (partial word searching!) but it also increases the size of the data stored in Redis by quite a lot — it would be great for some but likely not for all.

This has led me to the conclusion that Reds should be refactored to accommodate a plug-in like architecture. Out of the box, the module should continue on as normal — your index shouldn’t be affected, but if you are starting with a greenfield project, you can apply whatever works best for you. This is cleaner than monkey patching — it will allow a more standardized interface and sharing of plug-ins.

Plugable parts

Three logical parts should be pluggable: the language processing, how the data is written to the index and how the data is removed from the index. Notice that I didn’t include the actual query — I feel that the way this module queries the data is very flexible and is the ‘heart’ of the project. You can make huge changes to the language processing, indexing and de-index but still have the same query logic.

In my proposed version, the language processing is all contained in one single function — nlpProcess. You pass in a string (and optionally a key) and you get back an object with the required arrays needed for indexing. The optional key also gives you the metaphones keys needed for querying.

Indexing and de-indexing have their own functions which manage writing and removing data to Redis (respectively). Most of the time, you’ll need to pair these functions because the de-indexing needs to cleanup whatever was originally indexed.

Specifying these plugins is done by using an optional configuration object on the reds.createSearch function. So, to have your own language processing function you would do something like this:

reds.createSearch(‘my-search-namespace’, { nlpProcess : myNlpProcessFn });

I’m curious to know what people think — you can check out my proposed version at GitHub.