MongoDB Pagination, Fast & Consistent

Mosius
The Startup
Published in
9 min readOct 16, 2020
Image by gOrk barrette from Pixabay

When it comes to pagination, the simplest method that comes to mind is the skip-limit, but it is not always the best solution!

The Problem with the skip-limit

Although theskip-limit is suitable for some applications, it suffers from 2 major problems

1st. Poor Performance

As mentioned in MongoDB docs, it requires scanning all the previous docs before reaching the desired one, so it becomes slower as the offset increases. So the response times decline as page numbers increase.

2nd. Data Inconsistency

The problem occurs meanwhile the user is exploring data and any doc is being added or deleted simultaneously in the visited pages. it causes shifts in data which means we may face duplicated docs or not fetching some docs in the result. This problem is more likely to occur when we have a real-time system such as a feed.

What is Keyset Pagination?

The overall idea is to retrieve the next pages using the previous pages that have been fetched. In another word, you will tell what the last fetched item is and get the next page after that item. Easy, hun…?!

Any Downside?

By using the keyset method, we lose the ability to randomly access the pages. Simply saying, we actually can’t get page 3 without getting page 2 and getting page 2 without getting page 1. But the good thing is we don’t really care about that in most applications.

Let’s Explore Pages

Let’s use mongo-shell to find out the general idea. Then we go on by implementing a sample function in Python and JavaScript for more practical usage.

Data Preparation

I’ve inserted about 5k docs in the users collection with name, birth and height fields.

> db.users.count()
4940
> db.users.findOne()
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68c"),
"name" : "Ardra",
"birth" : ISODate("1970-11-04T12:03:32.988Z"),
"height" : 159
}

Fetching 1st & 2nd Page

We can fetch the first page as we did before. I’ve also limited each page to 3 docs for the sake of simplification.

> db.users.find().limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68c"),
"name" : "Ardra",
"birth" : ISODate("1970-11-04T12:03:32.988Z"),
"height" : 159
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68d"),
"name" : "Thea",
"birth" : ISODate("1985-05-16T20:41:33.861Z"),
"height" : 162
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68e"),
"name" : "Selia",
"birth" : ISODate("1976-05-12T07:01:39.035Z"),
"height" : 188
}

To fetch the second page we can use a natural feature of the ObjectId that makes it always incremental. So we can fetch the next page just by fetching the documents that their _id is greater than the last doc on the first page.

> db.users.find({ _id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b68e") } }).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68f"),
"name" : "Ellette",
"birth" : ISODate("1995-03-02T22:58:04.018Z"),
"height" : 174
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b690"),
"name" : "Odilia",
"birth" : ISODate("1975-04-24T23:08:09.316Z"),
"height" : 123
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b691"),
"name" : "Brandais",
"birth" : ISODate("1992-01-12T23:02:58.313Z"),
"height" : 184
}

That’s the basic idea!

Sorting and Tiebreaker

In many applications, we want to sort the query result based on some fields rather than _id. In that case, where we are sorting the result based on a non-unique field, there is an important point we should take care of. Consider we want to sort the result based on the user’s height.

> db.users.find().sort({ height: 1, _id: 1 }).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b6a4"),
"name" : "Carol-jean",
"birth" : ISODate("1981-11-30T18:34:29.613Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b791"),
"name" : "Kippie",
"birth" : ISODate("1991-03-06T22:05:31.554Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b795"),
"name" : "Pru",
"birth" : ISODate("1995-10-21T19:44:33.719Z"),
"height" : 121
}

Querying past the first page requires including a unique field (like _id in most cases) to act as a tiebreaker in cases where the height values are equal.

> db.users.find({
$or: [
{ height: { $gt: 121 } },
{
height: 121,
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b795") }
}
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty()
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7dd"),
"name" : "Jonis",
"birth" : ISODate("1971-03-19T00:19:56.970Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7e0"),
"name" : "Wendye",
"birth" : ISODate("1984-08-27T09:44:45.319Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7e3"),
"name" : "Wilie",
"birth" : ISODate("1974-09-15T10:51:23.528Z"),
"height" : 121
}

If we only queried for users with theheight greater than 121 (like what we did before with _id), we would end up skipping some docs that their height are 121 but they were not returned on the first page result.

At this point, you may be looking at the code example above and think:

“we can simplify the conditions to this!”

> db.users.find({
height: { $gte: 121 },
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b795") }
}).sort({
height: 1,
_id: 1
}).limit(3)

But that query would lead to a bug! When you are sorting by a non-unique key, your results will most likely return _id field out of order.

To clear it with an example, let’s consider querying against the following simplified data:

{ "id": "1", "height": 121 }
{ "id": "2", "height": 121 }
{ "id": "3", "height": 122 }
{ "id": "4", "height": 121 }

We would have this query and its result for the first:

> db.users.find().sort({
height: 1,
_id: 1
}).limit(3)
{ "id": "1", "height": 121 }
{ "id": "2", "height": 121 }
{ "id": "4", "height": 121 }

Now if we query for the second page there wouldn’t be anything as result:

> db.users.find({
height: { $gte: 121 },
_id: { $gt: "4" }
}).sort({
height: 1,
_id: 1
}).limit(3)

For the second page, we can’t expect any result, because we are looking for docs with _id greater than 4. As mentioned earlier, we need to prioritize height, and only utlize tiebreaker _id when the height is equal in target docs.

So far, we’ve almost covered many applications and we can confidentially include any condition in our queries to get a fast and consistent result, but…

*to sort based on more than 1 field, you can read my response here.

There’s an $or

In some cases where we need to declare an $or condition in the query’s root, we have to make sure that both the query and the pagination query are getting satisfied! For instance, imagine we want to find users whose names are shorter than 3 characters or whose names start with the letter X.

> db.users.find({
$or: [
{ $expr: { $lt: [{ $strLenCP: "$name" }, 3] } },
{ name: { $regex: /^X.*/i } }
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty()
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c0ae"),
"name" : "Di",
"birth" : ISODate("1988-03-27T23:22:10.843Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c58a"),
"name" : "Jo",
"birth" : ISODate("1997-05-07T09:44:10.037Z"),
"height" : 126
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c3d6"),
"name" : "Xylia",
"birth" : ISODate("1971-11-30T20:43:15.103Z"),
"height" : 127
}

To fetch the next page, we will put both $or operators in an $and operator.

> db.users.find({
$and: [
{
$or: [
{ $expr: { $lt: [{ $strLenCP: "$name" }, 3] } },
{ name: { $regex: /^X.*/i } }
]
},
{
$or: [
{ height: { $gt: 127 } },
{
height: 127,
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622c3d6") }
}
]
}
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty()
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c33f"),
"name" : "Ki",
"birth" : ISODate("1994-05-13T04:20:47.427Z"),
"height" : 132
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622bb3d"),
"name" : "Xenia",
"birth" : ISODate("1975-07-31T18:49:55.730Z"),
"height" : 133
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622bdfc"),
"name" : "Vi",
"birth" : ISODate("1979-09-21T19:46:21.595Z"),
"height" : 133
}

Checking Performance & Benchmark

Now let’s make these 2 methods into use and compare their performance in different cases. I also seeded the users collections with about 50B docs to make results look more like a real-world scenario.

Basic Pagination

Query: None / Sort: None / Limit: 25

With no query and no sorting, although in the beginning, the skip-limit seem to respond a bit faster, we can see its response time grows linearly in further pages. as we read before that's because it has to count(O(N)) from the beginning of the collection for each page. On the other hand, we can see keyset doesn’t really depend on the page number, as it can directly(O(1)) find the last fetched document in the collection(because of indexed _id) to retrieve the next page instantly.

Sorted Queries Pagination

Before we can continue, we need to create indexes for the users fields to be able to query them more efficiently. besides that, indexes also help us to speed up sorting our queries. Here I will create indexes for both height and birth fields.

db.users.createIndex({ height: 1 })
db.users.createIndex({ birth: 1 })
Query: { birth: { $gt: ISODate(“1998–01–01”), $lt: ISODate(“2000–01–01”) } } / Sort: { birth: 1} / Limit 25

As you may notice, after applying a query with sort, the response time has increased a little bit, but its shape is relatively the same.

Worse Performance and Fix

In a situation, where we need to sort the query result on a field that its values are just not identical, but they have a low variety(lots of records with the same value, height in our case), the keyset pagination performs much worse than the skip-limit.

Query: { height: 150 } / Sort: { height: 1 } / Limit: 25

But why? The reason is that the database doesn’t use the _id index when the height is the same in all the docs. So it begins to scan(O(N)) from the beginning of the list every time it wants to find the target _id (you can read more about handling collision in a hash table).

To fix the problem, we can create a compound index on both _id and height fields.

db.users.createIndex({ _id: 1, height: 1 })
Query: { height: 150 } / Sort: { height: 1 } / Limit: 25

And again, we have an excellent result. just note that creating more indexes causes poorer performance on inserting and it also occupies more space on disk or memory.

Show Me the Code

So far we have explored the data using mongo-shell . but now let’s implement a function where we can pass our original query to get a paginated query.

Python Implementation

We can use this function by passing the query to the first parameter, and the sort option as a list to the second parameter.

query = {'height': {'$gt': 150}}
sort = ['birth', -1]
limit = 10
query, next_key_fn = generate_pagination_query(query, sort)
users = list(db.users.find(query).limit(limit).sort([sort]))
next_key = next_key_fn(users)

We can then fetch the next page by passing the next_key to the third parameter of the function.

query, next_key_fn = generate_pagination_query(query, sort, next_key)
users = list(db.users.find(query).limit(limit).sort([sort]))

JavaScript Implementation

We can use this function by passing the query to the first parameter, and the sort option as a list to the second parameter.

const query = { height: { $gt: 150 } };
const sort = ['birth', -1];
const { paginatedQuery, nextKeyFn } = generatePaginationQuery(query, sort);
const users = await User.find(paginatedQuery).limit(3).sort([sort])
.toArray();
nextKey = nextKeyFn(users);

We can then fetch the next page by passing the nextKey to the third parameter of the function.

const { paginatedQuery, nextKeyFn } = generatePaginationQuery(query, sort, nextKey);
const users = await User.find(paginatedQuery).limit(3).sort([sort])
.toArray();

Final Word

Having algorithms that perform independently by the data size is beneficial value in products that need to scale up fast. Pagination is one of the basic stuff we can optimize to create such a product.

Besides the algorithm we may use, it is important to know what queries and sorting we require, so we can define appropriate indexes and enhance the response time.

Finally, I wish you enjoyed the article and find its information useful to use in your production-level applications.

Thank you for reading.

--

--

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

Responses (23)