Cursor based pagination with Spring Boot and MongoDB

Say goodbye to offset pagination with few lines of code

Davide Pedone
6 min readNov 23, 2020
Photo by Pixabay on Pexels

According to Facebook “Cursor-based pagination is the most efficient method of paging and should always be used where possible”. I’m gonna have a quick look into how to use it in your Spring Boot based application.

Building a RESTful Web Service is quite easy with Spring Boot, all we have to do is map entities, create a repository and a controller. If the repository extends PagingAndSortingRepository, as MongoRepository does, Spring Data REST recognises some URL parameters that will influence the page size and starting page number. Let’s write some code and see what happens.

Define an Entity, a Repository and a simple Controller:

@Data
@Document(collection = "tweets")
public class Tweet {
private ObjectId id;
private String author;
private String content;
private Long createdAt;
}
public interface TweetRepository extends MongoRepository<Tweet, ObjectId> {
}
@GetMapping(value = "/paged")
public Page<Tweet> getPaged(Pageable pageable) {
return tweetRepository.findAll(pageable);
}

Now a GET request like these http://myapi.com/tweets?size=10&page=1 will be translated by Spring in the following query against the database:

db.tweets.find().skip(10).limit(10); to retrieve the result and:

db.tweets.count(); to retrieve pagination metadata, such as total elements and total page numbers.

This approach works fine for most use cases but there are some caveats you should be aware of.

  • missing elements: if the dataset changes during page navigation you could miss some element. Deleting an element in a delivered page will shift elements of the next page up
  • result set scan: as reported by MongoDB documentation the cursor.skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, cursor.skip() will become slower.
  • count: counting all elements has an impact on performance

A better approach: cursor based pagination (a.k.a. range queries)

Imagine a cursor as a bookmark to a document in the database. When we ask the server for the first page of data, we of course get the data, as well as an additional information about this bookmark. Passing the bookmark in the next query allows the server to simply continue where he left off in the previous request.

Range queries can use indexes to avoid scanning unwanted documents, typically yielding better performance as the offset grows compared to using cursor.skip() for pagination.

Choosing the “bookmark”

Again, MongoDB documentation comes to the aid:

Use this procedure to implement pagination with range queries:

Choose a field such as _id which generally changes in a consistent direction over time and has a unique index to prevent duplicate values

Query for documents whose field is less (or more) than the start value using the $lt (or $gt) and cursor.sort() operators

Store the last-seen field value for the next query.

Ok, we know that _id is unique and sortable so we can use this field as a bookmark (from now on continuationToken). Really? Maybe not.

ObjectId value increases over time but we can’t be sure that sorting a collection by _id reflects creation order of the documents. In fact, ObjectId only contains a 4 byte timestamp value and it is generated client side by the driver. Also, as happens in a real case scenario, we want our API to offer the ability to sort results not just by id. For our Tweets API we can choose to sort results by author or by creationDate but both these fields can’t guarantee uniqueness. An author could write more than one tweet, and two or more tweets could have been created at the same time.

Image from https://giphy.com/

The solution? Use two fields at a time (id+author, id+creationDate, …) as continuationToken.

Our new GET request will look like this: http://myapi.com/tweets?size=10&continuationToken=abc_def&sort=author where abc and def respectively represents the ObjectId and the author of the last entity of the previous response.

We need to produce a query like this:

db.tweets.find({
$or: [
{ author: { $gt: "def" } },
{ author: "def", _id: { $gt: "abc" } }
]})
.sort({ author: 1, _id: 1 })
.limit(11);

Basically we are searching all documents, sorted by author and id, that satisfy one of the following conditions:

  • author: {$gt: "def"}
  • author: "def" AND _id: {$gt: "abc"}

The first one allows us to move on to next author if all tweets by “def” were already returned, the latter acts as a tiebreaker in case more entities by author “def” need to be retrieved. We asked for a page with 10 elements but we queried the database for 11 elements. Doing so we are able to determine if there are more items that needs to be retrieved simply checking if the results set size is greater than page size.

Trade off

As we said before, the field used for range queries must be unique and adding the _id to the chosen sort field will guarantee uniqueness even if the sort field is null. So we can sort by any field right? Wrong.

Let’s assume we have a collection containing the following elements:

[
{
"_id": "1",
"birthday": "2000/1/1",
"name": "abc"
},
{
"_id": "4",
"birthday": "2001/1/1",
"name": "def"
},
{
"_id": "2",
"birthday": "2020/1/1",
"name": "ghi"
},
{
"_id": "3",
"name": "jkl"
}
]

and we want to retrieve all records of the collection sorted by birthday with a page size of 3. We can easily imagine the sort result by birthday and id. The first record is the one with _id:2, then _id:4, then _id:1 and then _id:3.

Here is our query:

db.persons.find({})
.sort({ birthday: -1, _id: -1 })
.limit(4);

We have asked for a page with size of 3 elements, result set contains (all) 4 records from database so we know that there is a next page. Ok, run the query to retrieve it:

db.persons.find({
$or: [
{ birthday: { $lt: "2000/1/1" } },
{ birthday: "2000/1/1", _id: { $lt: "1" } }
]})
.sort({ birthday: -1, _id: -1 })
.limit(4);

But none of the 4 documents matches these conditions, so we know from the first query that there are more than 3 document but we can’t retrieve them.

Lesson learned: NEVER sort by fields that may contain null values

Putting things together

To make things easier I’ve written a small library. Just add it to the list of our project dependencies and go back to IDE to see how to use it.

<dependency>
<groupId>it.davidepedone</groupId>
<artifactId>spring-cursor-pagination-mongodb</artifactId>
<version>1.1.1</version>
</dependency>

Create a TweetSearchFilter class to add filter capabilities to our controller

public class TweetSearchFilter {

private String author;
// getter/setter
}

And a TweetPaginationService class to customize the query (configSearchQuery) and provide information to base class to build the continuationToken.

public class TweetPaginationService extends CursorPaginationService<Tweet, TweetSearchFilter> {

public TweetPaginationService(MongoOperations mongoOperations) {
super(mongoOperations, List.of("createdAt", "author"), Tweet.class);
}

@Override
public void configSearchQuery(Query query, TweetSearchFilter filter, Principal principal) {
Optional.ofNullable(filter.getAuthor()).ifPresent(a -> {
query.addCriteria(where("author").is(a));
});
}

}

Finally update the TweetController:

@GetMapping
public ContinuationTokenSlice<Tweet> getAll(TweetSearchFilter tweetSearchFilter, CursorPageable cursorPageable) throws Exception {
return tweetPaginationService.executeQuery(tweetSearchFilter, cursorPageable);
}

Now a GET request to http://myapi.com/tweets?author=xyz&sort=createdAt&direction=DESC&size=10 will produce something like this:

{
"content": [
[....]
{
"id": "5ed6a45d0038622837ba60da",
"author": "xyz",
"content": "......",
"createdAt": 1590585400521
}
],
"continuationToken": "4LFv0w30RzTOTFRy2RzRz15Wu2in73AfwM00KFhqVKSxqF9-yDMAZG8Jx7NpDoUz6oo97Vjssum5k6h7nw9o0g==",
"numberOfElements": 10,
"size": 10
}

The continuationToken is encrypted and contains a hash of the previous query in order to avoid alteration of parameters and consequently unexpected results. The decrypted value is composed of 4 parts, separated by an underscore character: a5b2111b666cd3bd5689effc1eda29b6_5ed6a45d0038622837ba60da_createdAt_1590585400521.

The first part represents the MD5 hash of the previous query string excluding page size, the second part is the id of the last returned object, the third is the name of the sort field, the fourth is the sort field value of the last returned object and the latter is the sort value class. As explained before Id, sort field and sort value allow the library to build the next page query:

db.tweets.find({
$or: [
{ createdAt: { $lt: 1590585400521} },
{ createdAt: 1590585400521,
_id: { $lt: "5ed6a45d0038622837ba60da" }
}
]})
.sort({ createdAt: -1, _id: -1 })
.limit(10);

Full source code is available on github.

--

--