MongoDB Pagination, Fast & Consistent
When it comes to pagination, the simplest method that comes to mind is the skip-limit
, but it is not always the best solution!
The Problem with the skip-limit
Although theskip-limit
is suitable for some applications, it suffers from 2 major problems
1st. Poor Performance
As mentioned in MongoDB docs, it requires scanning all the previous docs before reaching the desired one, so it becomes slower as the offset increases. So the response times decline as page numbers increase.
2nd. Data Inconsistency
The problem occurs meanwhile the user is exploring data and any doc is being added or deleted simultaneously in the visited pages. it causes shifts in data which means we may face duplicated docs or not fetching some docs in the result. This problem is more likely to occur when we have a real-time system such as a feed.
What is Keyset Pagination?
The overall idea is to retrieve the next pages using the previous pages that have been fetched. In another word, you will tell what the last fetched item is and get the next page after that item. Easy, hun…?!
Any Downside?
By using the keyset
method, we lose the ability to randomly access the pages. Simply saying, we actually can’t get page 3 without getting page 2 and getting page 2 without getting page 1. But the good thing is we don’t really care about that in most applications.
Let’s Explore Pages
Let’s use mongo-shell
to find out the general idea. Then we go on by implementing a sample function in Python and JavaScript for more practical usage.
Data Preparation
I’ve inserted about 5k docs in the users
collection with name
, birth
and height
fields.
> db.users.count()
4940> db.users.findOne()
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68c"),
"name" : "Ardra",
"birth" : ISODate("1970-11-04T12:03:32.988Z"),
"height" : 159
}
Fetching 1st & 2nd Page
We can fetch the first page as we did before. I’ve also limited each page to 3 docs for the sake of simplification.
> db.users.find().limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68c"),
"name" : "Ardra",
"birth" : ISODate("1970-11-04T12:03:32.988Z"),
"height" : 159
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68d"),
"name" : "Thea",
"birth" : ISODate("1985-05-16T20:41:33.861Z"),
"height" : 162
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68e"),
"name" : "Selia",
"birth" : ISODate("1976-05-12T07:01:39.035Z"),
"height" : 188
}
To fetch the second page we can use a natural feature of the ObjectId
that makes it always incremental. So we can fetch the next page just by fetching the documents that their _id
is greater than the last doc on the first page.
> db.users.find({ _id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b68e") } }).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b68f"),
"name" : "Ellette",
"birth" : ISODate("1995-03-02T22:58:04.018Z"),
"height" : 174
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b690"),
"name" : "Odilia",
"birth" : ISODate("1975-04-24T23:08:09.316Z"),
"height" : 123
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b691"),
"name" : "Brandais",
"birth" : ISODate("1992-01-12T23:02:58.313Z"),
"height" : 184
}
That’s the basic idea!
Sorting and Tiebreaker
In many applications, we want to sort the query result based on some fields rather than _id
. In that case, where we are sorting the result based on a non-unique field, there is an important point we should take care of. Consider we want to sort the result based on the user’s height
.
> db.users.find().sort({ height: 1, _id: 1 }).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b6a4"),
"name" : "Carol-jean",
"birth" : ISODate("1981-11-30T18:34:29.613Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b791"),
"name" : "Kippie",
"birth" : ISODate("1991-03-06T22:05:31.554Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b795"),
"name" : "Pru",
"birth" : ISODate("1995-10-21T19:44:33.719Z"),
"height" : 121
}
Querying past the first page requires including a unique field (like _id
in most cases) to act as a tiebreaker in cases where the height
values are equal.
> db.users.find({
$or: [
{ height: { $gt: 121 } },
{
height: 121,
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b795") }
}
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7dd"),
"name" : "Jonis",
"birth" : ISODate("1971-03-19T00:19:56.970Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7e0"),
"name" : "Wendye",
"birth" : ISODate("1984-08-27T09:44:45.319Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622b7e3"),
"name" : "Wilie",
"birth" : ISODate("1974-09-15T10:51:23.528Z"),
"height" : 121
}
If we only queried for users with theheight
greater than 121
(like what we did before with _id
), we would end up skipping some docs that their height
are 121
but they were not returned on the first page result.
At this point, you may be looking at the code example above and think:
“we can simplify the conditions to this!”
> db.users.find({
height: { $gte: 121 },
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622b795") }
}).sort({
height: 1,
_id: 1
}).limit(3)
But that query would lead to a bug! When you are sorting by a non-unique key, your results will most likely return _id
field out of order.
To clear it with an example, let’s consider querying against the following simplified data:
{ "id": "1", "height": 121 }
{ "id": "2", "height": 121 }
{ "id": "3", "height": 122 }
{ "id": "4", "height": 121 }
We would have this query and its result for the first:
> db.users.find().sort({
height: 1,
_id: 1
}).limit(3){ "id": "1", "height": 121 }
{ "id": "2", "height": 121 }
{ "id": "4", "height": 121 }
Now if we query for the second page there wouldn’t be anything as result:
> db.users.find({
height: { $gte: 121 },
_id: { $gt: "4" }
}).sort({
height: 1,
_id: 1
}).limit(3)
For the second page, we can’t expect any result, because we are looking for docs with _id
greater than 4
. As mentioned earlier, we need to prioritize height
, and only utlize tiebreaker _id
when the height
is equal in target docs.
So far, we’ve almost covered many applications and we can confidentially include any condition in our queries to get a fast and consistent result, but…
*to sort based on more than 1 field, you can read my response here.
There’s an $or
In some cases where we need to declare an $or
condition in the query’s root, we have to make sure that both the query and the pagination query are getting satisfied! For instance, imagine we want to find users whose names are shorter than 3 characters or whose names start with the letter X.
> db.users.find({
$or: [
{ $expr: { $lt: [{ $strLenCP: "$name" }, 3] } },
{ name: { $regex: /^X.*/i } }
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622c0ae"),
"name" : "Di",
"birth" : ISODate("1988-03-27T23:22:10.843Z"),
"height" : 121
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c58a"),
"name" : "Jo",
"birth" : ISODate("1997-05-07T09:44:10.037Z"),
"height" : 126
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622c3d6"),
"name" : "Xylia",
"birth" : ISODate("1971-11-30T20:43:15.103Z"),
"height" : 127
}
To fetch the next page, we will put both $or
operators in an $and
operator.
> db.users.find({
$and: [
{
$or: [
{ $expr: { $lt: [{ $strLenCP: "$name" }, 3] } },
{ name: { $regex: /^X.*/i } }
]
},
{
$or: [
{ height: { $gt: 127 } },
{
height: 127,
_id: { $gt: ObjectId("5f7f1f7d67f67ddaa622c3d6") }
}
]
}
]
}).sort({
height: 1,
_id: 1
}).limit(3).pretty(){
"_id" : ObjectId("5f7f1f7d67f67ddaa622c33f"),
"name" : "Ki",
"birth" : ISODate("1994-05-13T04:20:47.427Z"),
"height" : 132
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622bb3d"),
"name" : "Xenia",
"birth" : ISODate("1975-07-31T18:49:55.730Z"),
"height" : 133
}
{
"_id" : ObjectId("5f7f1f7d67f67ddaa622bdfc"),
"name" : "Vi",
"birth" : ISODate("1979-09-21T19:46:21.595Z"),
"height" : 133
}
Checking Performance & Benchmark
Now let’s make these 2 methods into use and compare their performance in different cases. I also seeded the users
collections with about 50B docs to make results look more like a real-world scenario.
Basic Pagination
With no query and no sorting, although in the beginning, the skip-limit
seem to respond a bit faster, we can see its response time grows linearly in further pages. as we read before that's because it has to count(O(N)) from the beginning of the collection for each page. On the other hand, we can see keyset
doesn’t really depend on the page number, as it can directly(O(1)) find the last fetched document in the collection(because of indexed _id
) to retrieve the next page instantly.
Sorted Queries Pagination
Before we can continue, we need to create indexes for the users
fields to be able to query them more efficiently. besides that, indexes also help us to speed up sorting our queries. Here I will create indexes for both height
and birth
fields.
db.users.createIndex({ height: 1 })
db.users.createIndex({ birth: 1 })
As you may notice, after applying a query with sort, the response time has increased a little bit, but its shape is relatively the same.
Worse Performance and Fix
In a situation, where we need to sort the query result on a field that its values are just not identical, but they have a low variety(lots of records with the same value, height
in our case), the keyset
pagination performs much worse than the skip-limit
.
But why? The reason is that the database doesn’t use the _id
index when the height
is the same in all the docs. So it begins to scan(O(N)) from the beginning of the list every time it wants to find the target _id
(you can read more about handling collision in a hash table).
To fix the problem, we can create a compound index on both _id
and height
fields.
db.users.createIndex({ _id: 1, height: 1 })
And again, we have an excellent result. just note that creating more indexes causes poorer performance on inserting and it also occupies more space on disk or memory.
Show Me the Code
So far we have explored the data using mongo-shell
. but now let’s implement a function where we can pass our original query to get a paginated query.
Python Implementation
We can use this function by passing the query
to the first parameter, and the sort
option as a list to the second parameter.
query = {'height': {'$gt': 150}}
sort = ['birth', -1]
limit = 10query, next_key_fn = generate_pagination_query(query, sort)
users = list(db.users.find(query).limit(limit).sort([sort]))
next_key = next_key_fn(users)
We can then fetch the next page by passing the next_key
to the third parameter of the function.
query, next_key_fn = generate_pagination_query(query, sort, next_key)
users = list(db.users.find(query).limit(limit).sort([sort]))
JavaScript Implementation
We can use this function by passing the query
to the first parameter, and the sort
option as a list to the second parameter.
const query = { height: { $gt: 150 } };
const sort = ['birth', -1];const { paginatedQuery, nextKeyFn } = generatePaginationQuery(query, sort);
const users = await User.find(paginatedQuery).limit(3).sort([sort])
.toArray();
nextKey = nextKeyFn(users);
We can then fetch the next page by passing the nextKey
to the third parameter of the function.
const { paginatedQuery, nextKeyFn } = generatePaginationQuery(query, sort, nextKey);
const users = await User.find(paginatedQuery).limit(3).sort([sort])
.toArray();
Final Word
Having algorithms that perform independently by the data size is beneficial value in products that need to scale up fast. Pagination is one of the basic stuff we can optimize to create such a product.
Besides the algorithm we may use, it is important to know what queries and sorting we require, so we can define appropriate indexes and enhance the response time.
Finally, I wish you enjoyed the article and find its information useful to use in your production-level applications.
Thank you for reading.