Elasticsearch scroll / search
There are multiple scenarios where we need to retrieve a lot of documents from elastic search.
Elasticsearch gives us a few APIs for this and I would like to go over them and give my recommendations.
The first API is the search API:
SearchSourceBuilder object that is used for all search API’s has a parameter of size. In this parameter you can define how many results you want elastic to return (default is 10). Using the standard search you can change the page size up to 10,000 (This limit is set using the index.max_result_window
index setting).
In addition to the page size there is the parameter of “from”. With these parameters you can get partial data of the full dataset. So for instant if your search will return 10,000, you can request from 1000 with 500 page size, and then if needed to get other pages.
Example for this is:
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
searchSourceBuilder.from(1000);
searchSourceBuilder.size(500);
This API is simple, but depending on the amount of data it is not very efficient. This API needs to run the full query on the cluster (including sorting on different nodes) so that it can jump to the offset and retrieve the data.
If your dataset is big, you are better off using one of the following API’s. In addition it is limited to the 10,000 records.
The second API is the scroll API:
In the case where we will be doing a lot of paging (since the data is very large), we would like elastic to save the data organized on the server so that for the next page we do not need to re-get all the data. This is done with a cursor that is similar to a database cursor. This api uses a scrollId parameter. The scroll api allows you to get a large set of data (even the full index). After running the request, you get a scrollId, that you then pass back to elastic to get the next page. The disadvantage of this API is that you cannot jump to a specific page, but need to scroll through all the data.
Since the data is cached in elastic you need to remember to delete the scroll once you are finished with it, and there is a limit to the amount of open scrolls.
Example for this is:
SearchScrollRequest scrollRequest = new SearchScrollRequest(options.getScrollId());
scrollRequest.scroll(TimeValue.timeValueMinutes(SCROLL_MINUTES));
search = elasticClient.scroll(scrollRequest, RequestOptions.DEFAULT);
Here we need to create a SearchScrollRequest (if we want to continue a previous scroll, we pass the scrollId). We also need to specify for how long we want the scroll to be saved on the server.
The third API is the search after API:
Search after works similar to the scroll but it is stateless. That means that there is no data stored on the server for this to work. What it does need is a sort key. Search after sorts the data by the key and then retrieves all data after the sort key value.
Example for this is:
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
sourceBuilder.searchAfter(hits.sortFieldValues);
search = elasticClient.search(searchRequest, RequestOptions.DEFAULT);
The issue with this api, is the sort key. If the key returns the same value, then you can miss entries since the next page will skip over them. So how do we make sure that the sort key is unique as possible.
The best solution that we found was to create a text field that is unique (usually not hard) and then to create a hash on the text field. The question is of course what hash function do you use.
Hash functions
We wanted to know specificity with our data if there is a difference in the hash functions. I wrote a test that selects the text field from all documents (we started with a test of 500,000 documents). For each document i ran the hash, saved the value in a HashMap. For each value if it already exists in the HashMap I saved it in another HashMap with the duplication count.
Java hashcode
As for code simplicity, this is the easiest. All you need to do is:
entityId.hashCode()
I was very surprised to find more than 100 duplicated of hash keys, so this method is obviously not good.
XXHash32
final XXHash32 xxHash32 = new XXHash32();
xxHash32.update(entityId.getBytes(),0,entityId.length());
persistentId = xxHash32.getValue();
MurmurHash3
Integer persistentIdInt = MurmurHash3.hash32(entityId.getBytes());
Both of the last hash functions did not have any duplication (further data sets need to be tested for more differences). We went with the MurmurHash3 since it is the most promising. For more information on the differences of hash functions see: