How to Speed-Up MongoDB Regex Queries by a Factor of up-to 10

With NoSQL databases, it’s easy to create documents that contain array of items. For instance, imagine a movie database where every document contains a movie title and the cast.

{
title: 'Matrix',
cast: ['Keanu Reeves', 'Carrie-Anne Moss']
}

To query a movie with Carrie-Anne Moss, we’d simplyrun db.movies.find({cast: 'Carrie-Anne Moss'}) to get the matching document back.

Using a Regex for non-exact search queries

Unfortunately, this is not how users would input data in a search field. 
They could enter something like ‘Carrie Moss’ or ‘moss carrie-anne’ and an exact find() query would fall short here.

Regular expressions (regex) provide a way to match strings against a pattern and MongoDB comes with a regex engine built in.

Utilizing regexes the cast search could be implemented with a query like

db.movies.find({
cast: { $elemMatch: {$regex: /Moss/i, $regex: /Carrie-Anne/i}}
});

$elemMatch will return those records, where an array element matches both criterias — in contrast, using a plain $and (which is the default for a list of criterias) without $elemMatch would return movies with ‘Carrie-Anne Moss’ but also those where ‘Sandra Moss’ and ‘Carrie-Anne Fisherare starring together. This would be a superset of the information we want to retrieve.
Also note the ‘i’ that makes the regex case-insensitive. We have to add that, because we can’t rely on your users to use their shift-key as they should.

In your first tests this will work well but as soon as your database and userbase grows, you will find out that those regex queries

  1. consume a lot of CPU time
  2. are extremely slow

Why can’t we just add an Index?

Indexes are the first thing to consider when optimizing query performance with any database. The MongoDB documentation is pretty clear that we are out of luck in this case, because the regex is case-insensitive. And even if we’d create an array with lowercased actors, we couldn’t still not benefit from optimized queries because we can’t use the ^ anchor to mark the beginning of the text. Why? Because ‘Carrie-Anne Moss’ and ‘Moss Carrie Anne’. We simply don’t know how the string we are looking for begins.

So no regular indexes for us. But recent versions of MongoDB support text indexes as well.
Text indexes let you perform search queries in arbitrary strings. This should be exactly what’s needed for our cast query.

Text Indexes will Safe Us

Well — it’s not that easy. Text indexes in MongoDB come with a few caveats:

  • If you want to index multiple fields in a document, all of them will be queried in a text search query. Means: There is no way to select fields to match against. So if you maybe later add a list of directors per movie and put a text index on it, a search will seek directors and cast.
  • They are by default very broad. A search for ‘Sean Connery’ will give us all movies, containing some actor called ‘Sean’, all kinds of ‘Connerie’s and along with our beloved ‘Sean Connery’.

On the other hand, text search queries are quite fast and efficient.
Can we maybe use them to-pre qualify documents for an exact search?

So let’s start by adding this index to our collection:

db.movies.createIndex( { cast: "text" } );

After that, we could try our first search query:

db.movies.find( { $text: { $search: "Moss Carrie-Anne" } } );

As mentioned, this will return a result but also false positives for or use-case.

Combining Text Search with Regex Matching

You know that in a conditional statement likeif(somefunc() && someOtherFunc()) {}, someOtherFunc() won’t be evaluated if someFunc() returns false. This is often referred to as ‘short-circuit’. The same applies to MongoDB queries. This means, if we use and to logically connect two conditions, the second won’t execute if the first returns no data.

Additionally, databases are smart enough to reduce the second query to the result set of the first so if we take a query like{a: 1, b: 2} we would first find all records with a: 1 and then reduce the result to all records matching b: 2 as well.

Applying this knowledge, we can create a query that first uses a text search to broadly find a superset of our final result set and then perform the more expensive regex query to narrow down the result:

db.movies.find({
$and:[{
$text: {
$search: "Moss Carrie-Anne"
}},{
cast: {
$elemMatch: {$regex: /Moss/, $regex: /Carrie-Anne/}}
}]}
);

Let me repeat:

  • When we perform a simple search against a text index, we will get all documents with indexed text that contains the words we are looking for. This is too broad but already a superset of the result we want.
  • A regex query added with a logical and will only traverse through the superset derived from the text search query.
  • If the text search returns no results, the regex query won’t be executed at all

Especially for large datasets, this will drastically reduce the CPU load and also speed up your queries. In my tests queries executed 10 times faster, of course returning the same results as with regex queries alone.

By the way — this isn’t only relevant for MongoDB or even text vs. regex queries. In fact, choosing the order of your conditions wisely, can drastically boost performance with any database.

HTH :)