Query Read Optimization in MongoDB

Published in

Major League

4 min readFeb 14, 2017

What is NoSQL ?

These are databases that provide mechanisms for saving and retrieving data in a different way than relational databases. Amongst their most notables features we can find:

Schema-less
Horizontal Scalability
High Speed

There are a lot of NoSQL DataBases, between those, one of the most popular and used in LateralView is MongoDB.

Why not a relational model ?

There is nothing wrong with a relational model since they can also scale horizontally, but the process may be more difficult and expensive.

Two of the big problems that one has to face in a relational model are the number of JOINS necessary for a extensive query and having to appeal to cache to speed up the process and avoid the execution of heavy and expensive operations.

How do we get the high speed ?

High speed is achieved in different ways:

Creating an index that support the queries.
Limiting the amount of query results to reduce network consumption.
Using projections to retrieve only the necessary data.

How do the indexes bring us a better response time ?

This is because they allow us to limit the number of documents to be inspected that satisfy our query. However, taking advantage of the creation of indexes may be counterproductive because:

By having more indexes, there are more plans that MongoDB can use to satisfy our query, and for that, the process to select the most efficient plan ends up taking longer.
The indexes at the same time give us performance in response time. Also they suppose a cost in memory, which is limited and, in case of surpassing, it will have to save the index in disk slowing down the query.

The best solution is to implement indexes in the most frequent and selective queries. These queries are the best excluding or filtering documents. The selectivity of a query could determine how efficient an index is, or if it is used or not.

How could we know the behavior of an specific index ?

One way of manually verifying the behavior of our queries is by using the explain method, that gives us an explainable object with the following information:

queryPlanner, which provides information about the used plan by MongoDB and information about the winning plan, this is, the most optimal.
executionStats, which give us statistics about the winning plan.

In the query planner, the information to be emphasized is the “stage” and “inputStage”, every inputStage indicates the stage that was performed previously, being its possible values:

COLLSCAN For a query above all documents in a collection.
IXSCAN For a query using some index.
FETCH For retrieve documents.
SHARD_MERGE For merging results from shards.

As to the executionStats, the most important data is:

nReturned: Number of documents returned.
executionTimeMillis: Amount of time to perform the query.
totalKeysExamined: Number of examined keys.
totalDocsExamined: Number of examined documents.

Being so, what we will look for is that the number of documents and keys examined are the most closest to the number of documents returned.

Example

To see it in more detail, lets verify the behavior of MongoDB in the following example:

Suppose we have a user collections where which one has an activation code, this code has the following characteristics:

It is alphanumeric.
It has a length of 4 characters.
It is used only once. After it is used, the field is deleted.
It is unique.

With all this information, we should probably have a query such as:

db.users.find({code: "7JA0"});

This way, the first thing you should think is to create a unique index in code.

db.users.createIndex({code:1}, {unique:true})

But, at the moment of having two users without a code field we will get an duplicate code error.

For that we have an extra option which is Partial Index.

db.users.createIndex(
    {code:1},
    {
        unique:true,
        partialFilterExpression: {code: {$exists:true}}
    }
)

In this way, we create a unique index in the code field which only indexes if the document matches the defined criteria (the code exists)

Suppose the following dataset:

{ "_id" : 2, "code" : "AB3F", "email" : "pepe@test.com" }
{ "_id" : 3, "email" : "juan@test.com" }
{ "_id" : 4, "email" : "mau@test.com" }
{ "_id" : 5, "email" : "julian@test.com" }
{ "_id" : 6, "email" : "ruben@test.com" }
{ "_id" : 7, "code" : "BB5H", "email" : "mar@test.com" }
{ "_id" : 8, "code" : "C48F", "email" : "leo@test.com" }
{ "_id" : 9, "code" : "D51M", "email" : "fede@test.com" }
{ "_id" : 10, "code" : "7JA0", "email" : "dani@test.com" }

And suppose we get the following query:

db.users.find({code: "7JA0"}).explain(true);

Lets see the result of the query without index:

Where the main points that one can find in this query are:

A COLLSCAN was performed (No index could be found that would improve the query).
All documents were examined to retrieve only one

On the other hand, if the partial index defined above is applied, the result would be different:

Where the main points that one can find in this query are:

An IXSCAN was performed.
The index code_1 defined previously was used.
Only one key and document were examined.
There wasn’t rejected plans.

Conclusion

As you can see, the use of indexes is important to improve performance on queries to the data base, also there are ways to check how well our query is being executed, and which fields we should evaluate to see if the index is efficient or not.