Creating Partial Cloudant Indexes

Filter data before it’s indexed to get the most out of your storage


Indexing is what makes database queries fast and scalable. Without an index, a database is forced to trawl through every record to calculate the answer to a query. A carefully designed index allows a query to be answered with a fraction of a work by jumping to the pertinent portion.

Indexing basics

Let’s dive in with the example of road safety data from the UK government’s open data portal. There is a record for each traffic accident that looks like this once de-normalised and imported into a Cloudant database:

There are 1,176,672 records in the database, so flying without an index is going to be an expensive operation. In fact, if you attempt to make a query without first creating an index, Cloudant will oblige but issue a warning in the returned data:

The fields you need to index are related to the fields that you use in the selector object in your query. If we are going to be making queries between two dates, then it makes sense to create an index on the Date field:

We are making a call to the _index endpoint of our Cloudant database, passing in a JSON object that describes the index we wish to create:

In this case we only need to index the Date field.

On receiving this request, Cloudant trawls through each document in the database and creates an index where each record is organised into Date order. On a large data set this can take some time—check the progress in the "Active Tasks" tab of your Cloudant dashboard or call the _active_tasksendpoint.

A range query against a Cloudant index.

Once the index is built, we can then query this index, passing in a date or a range of dates, and Cloudant will use the index to fulfill the request:

Building an index with a partial filter

The first index we created stored an entry for each of the 1m+ records. This may be exactly what we intended, but in some cases we may only ever be interested in a subset of the data.

Let’s say we are only interested querying accidents that occur during a weekend — there is little point indexing those that occur during the week. This is where Cloudant’s index definitions with a partial_filter_selector are useful.

The partial_filter_selector parameter instructs the indexer to filter the data prior to indexing, meaning that the index no longer holds data for every record in the database, only the ones that match the supplied selector.

Filter your data before writing to the index to optimise for the queries you know you’re going to ask.

At query-time, you can further winnow the data on the fields you chose to index:

Our query for 2015 accidents using this partial index now only contains accidents that occurred during the weekend.

For extra code readability, you can also include the original partial selector in your query-time selector:

Cloudant will only use a partial index to answer a query if you specify either the ddoc or use_indexfields at query time. See the documentation for full details.

Why partial indexes?

Partial indexes pre-filter the data before it is written to the index. This can make your indicies smaller, leaner and quicker for Cloudant to work with. Also, as Cloudant pay-as-you-go services charge per GB of data (which includes the data used to store your core JSON and index data), smaller indexes can help keep your costs down too!