Increase ElasticSearch scroll performance in your java application

Markus Dybeck
PriceRunner Tech & Data
4 min readSep 21, 2020

Scrolling trough a lot of documents in an elastic search index might take some time, but fortunately there is a few things that we can do about it. In this article I’ll go through two, small but effective, things we can do to increase the scroll performance.

Scroll Batch Size

The first thing we should do is to find the optimal batch size for our environment. I’ll define the optimal batch size as the size where we get the most throughput for our query.

A small batch size might cause network communication overhead while a big batch size will cause memory allocation overhead if our documents are big. The sweet spot is probably somewhere in the middle and will be different for most environment as it depends on network latency, how the data looks like, index size and more. Therefore we need to iterate through different batch sizes in order to find that sweet spot.

I’ve quite often assumed that a higher batch size is better, but that must not be the case and a batch size of 10 documents might get you a higher throughput than 100, 1000 or even 10000 documents per batch. So start of by testing 10 documents per batch and then increase from there until the throughput degrades.

If we’ve found that optimal batch size but aren’t happy with the throughput we should take a look at our cluster and it’s resources. Is the CPU (and/or memory) operating at 100 % or do we still have resources that we can use? If we do have plenty of resources left or have a cluster with more than one node, then we have at least one performance booster left that we can start playing around with - Slicing.

Sliced scroll

If we want to scroll through our documents using any kind of parallelism we need to perform a so called sliced scroll. If the cluster have several nodes we will definitely benefit from slicing as we may spread out the workload on more CPUs.

It came to my attention that this wasn’t that straight forward to do in a java application, well the slice query was straight forward, but to fully utilize it wasn’t.

In order to do regular scrolls (without slicing) we’ve created an iterator that initially makes a scroll request and then keeps on scrolling for each iteration. The usage in our java code pretty much turns into

This iterator can’t handle slicing out of the box since each scroll request needs to be initialized with the slice id and the maximum number of slices - “This is slice 1 of 2 for query x”. That means that the iterator will only be able to handle one sliced scroll.

To benefit from slicing we can create a new iterator with the needed slicing parameters. If we want our scroll to be sliced two times, we can create two iterators.

We can then iterate through each iterator within a separate thread and we have achieved parallelism. But that is not very dynamic. A cleaner approach would be if we could write something like;

Above we have created a Spliterator that we supply to a stream with parallelism set to true. The spliterator takes the number of slices as its last parameter and is then responsible for creating the number of iterators as needed.

Below is the spliterator class, it has a quite simple and naive approach. First we construct the spliterator with a list of all iterators, in our example we would have a list of two iterators. Then if parallelism is asked for, we will try to split our spliterator in half by creating a new spliterator with half of the iterators in our list.

tryAdvance is the equivalent of our iterators hasNext and next methods. There we simply check which iterators in our list have more elements and then accept the given action by getting the next batch of elements from those iterators.

The scroll response returns the number of total hits for our request and therefore we can estimate a correct size for each iterator, hence the sized characteristics.

Conclusion

We now have a more dynamic way of handle sliced scrolling and may start experimenting with the number of slices and batch size for each slice to achieve maximal throughput. The process is more or less the same as when finding the batch size for a single threaded scrolling.

I would recommend starting by using the same max slices as the number of shards for your index iIf you have your shards evenly spread over your cluster. Then you will use resources from all nodes and thereby probably increasing your scroll speed.

A note from the docs about the number of slices

If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals to N bits per slice where N is the total number of documents in the shard. After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of sliced query you perform in parallel to avoid the memory explosion.

A note about thread pools

By default the common fork join pool is used when using parallelism in a stream, it could be useful to supply your own thread pool for this task. It can be done by wrapping the whole task in a pool.

Happy scrolling,
Markus Dybeck.

--

--