Cloudant Fundamentals: Indexing
Primary or secondary? Type json or text? Understanding your options (part 9 of 10)
In part 7 of this series, we saw a warning in the search results:
“no matching index found, create an index to optimize query time”
This is Cloudant’s polite way of saying that your query is expensive and the database is having to walk through the whole data set to calculate the answer. In small databases this is not a problem, but in a production system, with the data growing all the time, an index is essential.
What is an index?
A database index is just like an index in a book or a biblical concordance. It is a sorted data structure that allows quick access to a portion of the data.
Our data looks like this:
Cloudant automatically creates an index on the
_id field so that it can retrieve data by
_id. If we are going to be making lots of queries on the
dob field, then it might make sense to instruct Cloudant to create a secondary index on that field.
This is achieved by writing to the POST /db/_index endpoint:
In the body of the JSON supplied to the
_index endpoint, we specify the array of fields that are to be indexed. Cloudant then creates the index on disk (ordered by
dob in this case), so that it can be used by future queries.
If we repeat our query, we should see the same results, but without the warning:
Our query is now using the index and you should see a performance improvement, especially with larger data sets.
Index types — json & text
Cloudant Query has two types of index —
type='json' are built using Cloudant's MapReduce technology. This creates single-use indexes in the order of the keys you supply. For example, if you index
dob then the index will be in
dob order and useful for queries that involve selectors on those fields in that order.
type='text' are powered by Cloudant's Search technology. This builds separate "concordances" for the fields you supply, making it more universal for the database at query time. (See the $text operator for matching arbitrary strings against blocks of text in your documents.)
Both index types have their subtleties, so make sure you’ve read the documentation and understand how it works before going into production!
The job of an index is to help the database to reduce the volume of data it’s working with to a manageable size. Imagine you have a query like this for a database of movies:
Which field or fields should you create the index on to best serve this query? It depends on whether which is the smaller:
- the number of movies in that database for a given year
- the number of movies that star an actor
In this case, I’d be inclined to create an index on
actor because there may be thousands of movies in a year, but an actor might be credited with only a few dozen movies in their entire career. So an index on
actor winnows the database to a smaller data set than an index on
Our index on
actor reduces a database of tens of thousands of records down to a few dozen. It is then simple for the database to apply the second clause of the
$and to this smaller data set.
- Use the POST /db/_explain to see how Cloudant picks an index for a specified query.
- Cloudant allows you to specify the index you want the database to use at query time. This is good practice as you remove any ambiguity in index selection.
- Consider creating partial Cloudant indicies if you are only ever interested in querying a subset of your database (i.e., you could eliminate preliminary blog posts from your index and only include published posts in a partial index).
- There’s always Cloudant’s MapReduce, Search and Geospatial indexes for a lower-level querying experience.
- Not all queries are “Cloudant-shaped”. If you are building leader boards or counting distinct values, you might want to consider rolling your own custom index outside of Cloudant.
In the final part of this series, we’ll look at performing aggregations using Cloudant.