Optimizing Queries in RethinkDB

Anyone who’s been writing code for a couple of years has likely heard the quote “premature optimization is the root of all evil”. The idea is that it’s better to design and implement a clean solution that gets the job done, then benchmark and optimize later. While this is great advice, there comes the time, hopefully before going into production, when you need to address your applications performance. In this post I’ll be explaining how to optimize a query in RethinkDB.

RethinkDB

At ThredUP we use an API backed by RethinkDB to power our shop web UI. RethinkDB is a realtime JSON database that, at least on the surface, has some resemblance to other NoSQL databases such as MongoDB. One difference is the RethinkDB query language called ReQL. With ReQL you build up queries on the client using your language of choice (JavaScript, Ruby, Python, and Java are supported) using a functional style syntax. When writing a ReQL query in application code, it looks and feels like a normal function but then once the code executes, the query runs on the database server. In this blog post, I’ll be using JavaScript for code samples. The RethinkDB JavaScript driver has both a promise based and callback based API. I’ll be using the promise API. The code samples assume you’ve configured RethinkDB and have an open connection referenced by the variable conn.

Primary Keys

RethinkDB tables have a primary key field, which you can specify at the time the table is created. The primary key must be unique just like in any other type of database. If not specified, RethinkDB will automatically create a primary key field named id, which is a UUID string. The get method in the RethinkDB api is used to efficiently query by the primary key. This method takes the primary key value and returns a single document. For example, if you have a table of blog posts you could retrieve a single document by it’s id using a query like this:

r.table.get('00c0a336-b9dc-4e76-9392-23eec6839ce7').run(conn);

Queries on Non-Primary Keys

More often than not you need to query based on one or more fields that are not primary keys. If we want to look up blog posts by author, the get method doesn’t work because it only returns a single document and there may be many posts by a given author. A simple way to do this is to use the filter method, which takes a object literal “predicate” and returns any document for which it is true. For example, to query for blog posts by a given author:

r.table('posts').filter({ author: 'Kendra Mertz' })
.run(conn)
.then(cursor => cursor.toArray())
.then(posts => console.log(posts));

Extending the Blog Posts Example

Suppose we add tagging capability to our blog posts by adding a tags field, which is an array of strings. Now we want to look up all posts from a certain author and having a certain tag. The easiest way to do this is to change the query to use the row command along with eq and contains operators:

r.table('posts').filter(
r.row('author').eq('Kendra Mertz').and(
r.row('tags').contains('cat'))
).run(conn);

Even more powerful and arguably more readable way to achieve the same functionality is to use an anonymous function as the argument to filter:

r.table('posts').filter(
post => r.and(post('author').eq('Kendra Mertz'),
post('tags').contains('cat'))
).run(conn);

These two queries return the correct results, however it can be made much more efficient. But in order to do so, we’ll need a secondary index. It turns out the filter method does not actually uses indexes at all.

Secondary Indexes

There are three types of secondary indexes:

  1. Simple: on a single field
  2. Multi: on an array field
  3. Compound: combination of two or more fields

To optimize our filter query we only need a simple index on author. We can create it like so:

r.table('posts').indexCreate('author')
.run(conn)
.then(result => console.log('index created'))
.catch(err) => console.error(err));

Now instead of using filter, we can use the getAll method, specifying the index . Here’s an example, ignoring the tags criteria for now:

r.table('posts')
.getAll('Kendra Mertz', { index: 'author' })
.run(conn);

Now to include tags in our query we just chain the filter command:

r.table('posts')
.getAll('Kendra Mertz', { index: 'author' })
.filter(r.row('tags').contains('cat'))
.run(conn);

This gives correct results but there is a much more efficient way to implement this query. Your first thought might be to create a new multi secondary index on the tags field. However, this will not help performance because, as we learned above, filter does not use indexes at all. To make this query more efficient we need to use an index mapping function like so:

indexCreate('authorTags',
(post) => post('tags').map(tag => [ post('author'), 'tag' ]),
{ multi: true }
).run(conn);

Then the we can remove the filter and contains from our query and use the newly created index in a getAll:

r.table('posts')
.getAll(['Kendra Mertz', 'cat'], { index: 'authorTags' })
.run(conn);

Benchmarking

In Node.js development a quick and dirty way to benchmark performance is just to measure the wall clock time by using console.time and timeEnd:

console.time('get posts');
r.table('posts')
.getAll(['Kendra Mertz', 'cats'], { index: 'authorTags' })
.run(conn)
.then(() => {
console.timeEnd('get posts');
});
// get posts: .8765ms

Node core’s process.hrtime can be used instead if you need high resolution time (nanoseconds). This technique is great for getting a rough idea of performance, but sometimes you may need more statistically sound benchmarking. This is probably not something you want to take on yourself. Fortunately, John Dalton, creator of lodash, built the fantastic benchmark.js library. I created a repo on github to benchmark the RethinkDB queries from this article using benchmark.js. The repo sets up a test database with a single table containing 1000 blog posts, creates three different indexes, and benchmarks four different queries. The results, including ops/sec and mean run time, are logged to the terminal. The results show how significantly performance can be optimized by using the right index and RethinkDB query.

Results

  • The two queries using filter perform nearly identical and clock in around 95 ops/sec with 10 ms mean run times.
  • getAll with filter chained we get a big improvement of 620 ops/sec and 1.61 ms mean run time.
  • mapped multi index is nearly twice as fast, clocking in at 1,173 ops/sec and a .84 mean run time.