Elasticsearch, how we paginated over 10 000 items

Benoit Travers
6 min readJan 24, 2019

--

Ouest-France is a daily French newspaper known to be the most read francophone newspaper in the world. For a year, we were working on a new version of the website. The website is split into multiple microservices which are totally agnostic from each other:

  • Agnostic CMS which orchestrates incoming requests,
  • Block Providers which are content services such as articles blocks (lists and details), weather forecast blocks, etc…
  • Page-Builder used to set up all blocks with a user-friendly UI for writers.

Our team is responsible for the main block provider, which delivers all content related to articles. All articles are stored in an Elasticsearch cluster and we distinguish two kinds of request:

  • Get one full article by an identifier, used for detail blocks. Those requests are pretty basic and we will not have any interest in them. Here is the result of the detail block.
  • Get a list of articles matching some criteria configured by writers through the Page-Builder. Those lists are paginated and some of them have way more than 10 000 articles. Here is the result of the list block.

When we perform a search request on an Elasticsearch index, from + size of the request cannot be greater than index.max-result-window. By default, this value is set to 10 000 at the index creation.

How to deal with lists larger than 10 000 items?

After some research, here are the different options we have:

  • Limit lists to 10 000 articles. Most of the time, this is the right answer. Indeed, if you need pagination over more than 10 000 results, maybe your initial request is not precise enough. But in our context, Elasticsearch is not just a search engine. For SEO purposes, some lists of articles have to display the whole history which is above 10 000 articles.
  • Increase the value of index.max-result-window. This value exists in order to preserve Elasticsearch cluster memory from large queries. Increasing this value can lead to cluster latency and worst… crashes. Knowing that articles are the main service of the website, we cannot let that happen.
  • Use scroll API of Elasticsearch. When we have large requests and when latency is not a major cause, it is a good practice to use scroll API. Despite multiple warnings found on the web, we implemented this solution. It was as expected very slow, taking tens of minutes in addition to being costly because of the state kept by Elasticsearch between each iteration. In summary, this solution is not acceptable for real-time user requests.
  • Use search_after search request. This is the suitable solution for real-time user requests but we need to know the last result from the previous page.

Implemented solution

In our articles service, we have two branches:

  • If from + size is lower than or equals to 10 000, we perform a classic Elasticsearch query,
  • Otherwise, we use pre-calculated pages and we perform a search_after query based on the last articles of the previous page.

In other words, pages within the first 10 000 items are fresh because computed on demand using a classic Elasticsearch request. Other pages are static, pre-calculated, not as fresh as expected but it is acceptable for SEO purposes.

The main challenge is to have an “almost up to date” index with the information of the last article for each page. For example, in order to display the sport page 2000, the articles service needs to know the last article of the sport page 1999 then performs the search_after query based on this article.

First of all, we need an Elasticsearch index with all queries that have more than 10 000 results. We created a service named paginator that manages those queries, needed to calculate and to refresh pages.

Example of document in queries index

Each Elasticsearch query has a predictable identifier which is a hash of request field (the MD5 result of the stringified request).

Second, we need an index with all calculated pages. We created an other service named paginator-calc that receives a query and performs a scroll query in order to compute all pages.

Each page has the query identifier, the page number, and information about the last article needed for the search_after query. This service is not a part of paginator service because it can scale differently.

Example of document in pages index

Pages are calculated. The last step consists in implementing the search_after query in articles service when from + size is greater than 10 000.

Example of search_after request

We have a second sort on the field id which acts as a tiebreaker.

Overview of the interaction of the components discussed

How to register queries in paginator service

Actually, the solution is not complete. For now, there are some issues that have not been addressed yet :

  • Pages are computed once but writers continuously add content and the total number of pages is growing. Pages need to be refreshed,
  • Criteria for lists of articles are not frozen. Writers set them when setting up list blocks and edit them depending on their needs. In addition, criteria are not known before the article service is requested. To sum up, queries index cannot be manually populated.

To overcome these issues, when articles service performs an Elasticsearch request with more than 10 000 total hits, it sends the query to paginator service. Then paginator service creates an entry in the queries index or updates the lastUse field of the existing document. If the last page associated to the query is not computed, which means none of the pages are computed or writers added so many contents that a new page has been created, the paginator service sends the query to paginator-calc service for pages calculation.

Last but not least, queries and pages cleaning

Queries are now dynamically created depending on admin website configuration. When a query is no more used, it is a waste of computational resources to refresh associated pages. That’s why we implement a mechanism to clear old queries and associated pages.

A CRON triggers the paginator service in order to retrieve old queries (based on lastUse field). For each query, a message is sent to paginator-calc service to delete all associated pages. Once pages are deleted, paginator-calc sends a response message to paginator service in order to delete the query.

Some improvements

Pages are computed each time a new page is created or the first time the query is registered, which means static pages (pages after the first 10 000 articles) are “almost up to date”. In order to go further and refresh pages each time a new article matching criteria is added, we can use Elasticsearch percolate queries.

For a simplified and open-source version of this project, see https://github.com/btravers/elasticsearch-paginator.

--

--

Benoit Travers

French developer, currently working for Zenika Rennes.