Pagination in Event Sourced systems
In this article we examine the difficulty of implementing pagination in a system that uses event sourcing. We assume that the reader is familiar with event sourcing. We explain the concept of pagination and then show how we can accomplish it by using events.
What is pagination?
Pagination is the process of breaking up a potentially infinite amount of data into finite ordered partitions, and being able to retrieve partitions in order.
For example, consider an infinite set of even numbers:
When we paginate this set, we may be interested in the first three numbers:
The problem is then how do we get the “next” three numbers.
We can solve this problem by indexing each number, like so:
2(0), 4(1), 6(2), 8(3),…
where the value (n) in parenthesis means that the element is n-th in the order of all elements. We may then ask for the next three elements, starting at index (3):
8(3), 10(4), 12(5)
This, however, presents a number of problems. One of the most interesting ones is what happens when we delete an element?
Let us remove the number “4” from the original sequence, now we are left with:
2(0), 6(2), 8(3),…
Clearly, the indices no longer “line-up” with the element’s number in the sequence. So, when we ask for the first three elements, starting at index (0), we may get:
Alternatively, we may be smart enough to figure out that the answer should be:
2(0), 6(2), 8(3)
In this case, it is clear that the value (n) in parenthesis has little to do with the actual index of the element. Why not, then, generate a random identifier instead and use that to tag the elements, like so:
2(331), 4(481), 6(933), 8(312),…
For the purposes of the article, we only use a 3-digit identifier. In a real system, we would use a long identifier to avoid collision, or ensure that collision does not occur through other means.
Consider, now, what happens when we remove the element “4” as before:
2(331), 6(933), 8(312)…
and ask for first three elements starting with (331):
2(331), 6(933), 8(312)
The answer is correct.
So it is that we have devised a strategy of storing values so that we may retrieve them in partitions regardless of additions or deletions.
Let us state the general principle of designing a pagination system:
When storing data for pagination, produce a unique identifier and store it with each piece of datum. When paginating the data, provide a unique identifier and the number of results desired.
In database systems this is called keyset pagination, and a full article on implementation can be found on Use the Index, Luke website.
Event Sourcing and Pagination
Event Sourcing is the practice of recording the state of the system as a sequence of events instead of one single state.
For more information about Event Sourcing, I recommend you seek out a six hour class on CQRS by Greg Young.
The problem with pagination we face is the following one.
We have the data in this form:
Event for C, Event for B, Event for A, Event for A …
But we need to do be able to answer questions of this form:
Give me the first three objects from the collection of (A,B,C,…). Assume, for the sake of brevity, that we are interested in alphabetical ordering. Following our general principle for designing pagination systems above, we must:
- Produce a unique identifier when storing each object
- Be able to answer a query for some number of objects given the identifier.
Let us tackle each issue separately.
Issue 1: Producing an identifier
How can we produce an identifier for each individual object? The answer must depend on the requirements of the system. Often, the identifier is already present in the business process, for example, when a customer opens an account, the bank creates an account number for the customer. All further events in the system are then linked to that account number.
In general, there are two possibilities:
- Re-use the business process identifier.
- Create an identifier.
We must note, that in some cases the implementation, like database, creates an identifier (e.g. by incrementing the value of some counter). While this is also a valid identifier, what the system designer has to keep in mind is that this identifier must be shared with the system’s user, as we will see below. Whether the designer chooses to depend upon specific piece of technology to create such an identifier or craft it themselves is irrelevant.
Once we have an identifier that is associated with each event in the system, the whole process becomes that of creating new records, updating or deleting from some information storage system or data structure.
These operations are exactly the INSERT, UPDATE and DELETE that one sees in most relational databases. And so the developer is either already very familiar with how to accomplish the task of persisting the changes of each event, or can easily learn by reading documentation for a particular database technology they decide to use.
Issue 2: Querying the system
The system must be able to answer the query to retrieve N-number of objects, starting with some object with identifier X.
The simplest implementation of such a system is one that uses B-trees. However, it is seldom that a program in today’s age has to implement their own data structure, nor, arguably, should they. Most industrial databases of today provide keyset pagination out of the box. So it is, that while the writing the data structure needed to support our kind of query is complex — using an existing system to accomplish the same task is simple.
While the solution we presented works good for retrieving N-records starting at some record X, we can ask the following question:
“How can one navigate directly to page 25 of some record set if we don’t know the identifier of the record that we should start querying at?”
This question seems to reveal a problem in our design. But let us consider the problem that this question is trying to solve. Why does the user need to navigate to page 25 of some collection of data? What business problem does it solve?
Having answered the questions above, it would be better to design another data store that will, at the same time, denormalize events to address these very specific requirements. All of this to say that keyset pagination is not the right answer to every problem.
In this article we looked at what pagination is. The two ways of designing the system for pagination. Having decided on the latter approach — keyset pagination, we presented the two requirements a system must fulfill to implement such a pagination. First, how to generate a unique identifier. Second, how to query for some number of records starting at the record identifier provided by the user. Finally, we addressed concern that the design was incorrect by arguing that keyset pagination was not meant to solve every problem and that another approach can be employed in parallel to rectify the situation.
By separating the way we store information about the behavior of the system (events) and the way we store the information for specific uses like pagination (relational database), we are able to simplify the whole significantly. Each component is now doing its own, arguably trivial, job. I encourage you to try event sourcing, and compare it to traditional database-centric approaches. It is much more pleasant to work with!