How to develop an Elasticsearch feature?

Elasticsearch is an open-source project that allows you to contribute. In this article, I will explain the problem we had about product sorting and how I solved it by contributing to Elasticsearch.

Ready?

Problem — Sort parent documents by the first child of nested arrays

I wanted to sort parent documents by first child of a nested array. For example I have a mapping and two documents(iPhone X and Samsung Galaxy S8) like these:

PUT products
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "keyword"
},
"variants": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"price": {
"type": "integer"
},
"sortId": {
"type": "integer"
}
}
}
}
}
}
}
PUT products/product/1
{
"name": "iPhone X",
"variants": [
{
"name": "iPhone X 256GB",
"price": 6900,
"sortId": 1
},
{
"name": "iPhone X 64GB",
"price": 8300,
"sortId": 2
}
]
}
PUT products/product/2
{
"name": "
Samsung Galaxy S8",
"variants": [
{
"name": "
Samsung Galaxy S8 Altın",
"price": 3350,
"sortId": 1
},
{
"name": "
Samsung Galaxy S8 Gümüş",
"price": 3550,
"sortId": 2
},
{
"name": "
Samsung Galaxy S8 Siyah",
"price": 3400,
"sortId": 3
}
]
}

And lets try to sort documents descending with following query:

POST products/_search
{
"sort": [
{
"variants.price": {
"order": "desc",
"nested": {
"path": "variants"
}
}
}
]
}

It first brings iPhone X because iPhone X’s biggest price is 8300 and Samsung Galaxy S8's biggest price is 3550. This is the normal behavior of Elasticsearch.

However, I need to sort the products by the price of the nested document with lowest sortId in variants array.

First, I tried sorting with scripts but when I used nested filters, the result was not correct.

Elasticsearch has SortMode feature for these situations. If Elasticsearch had a FIRST SortMode, I could solve this problem.

Because of the problems above;

I opened an issue to Elasticsearch to propose the FIRST sort mode.

Prepare to develop Elasticsearch feature

Elasticsearch is a girl?

Elasticsearch is written in Java and I chose use IntelliJ for development. After cloning Elasticsearch project from its repo, I ran the following command to work on the project using IntelliJ.

gradle idea

Solution 1—FIRST SortMode

Elasticsearch is a really large project and it has complex project structure so it is very hard to start implementing directly. First, I dived deeply into the source code to find classes I need change. The file I first changed was following class:

server/src/main/java/org/elasticsearch/search/MultiValueMode.java

MultiValueMode.java class contains all sort modes and their implementations in it. Some sort modes are only applicable for specific types and some line Min and Max are applicable for every field type. FIRST mode also should be applicable for all field types.

To do that, I needed to override every MultiValueMode#pick method, which was pretty easy to implement. I just iterated DocIdSetIterator object through the last value. (Commits: 4079137, 1ef661)

If you want the FIRST value, why do you iterate to the last?
- Because “Elasticsearch reverse to put all parents after the children”

We discussed about in the issue.

After the implementation, I opened a pull request to Elasticsearch to send my code to review. Following query is the new sort query that solves our problem:

POST products/_search
{
"sort": [
{
"variants.price": {
"mode": "first",
"order": "desc",
"nested": {
"path": "variants"
}
}
}
]
}

This solution worked well for us but they want a more generic solution and decided to limit nested array using a new property. So they proposed max_children property, which is also fine to solve our problem because when we set max_children to1 and it was going to work like FIRST SortMode. So I started to work on the implementation of max_children.

Solution 2 — max_children limit

When I started to implement max_children, I faced with a big problem. Because Elasticsearch puts all values into DocIdSetIterator in reverse order I needed to iterate values from last to first, which was not possible. If I knew how many values in this iteration, I could do the implementation. #pick methods has startDoc and endDoc arguments and DocIdSetIterator has #cost method. When I use filters, these options failed for my implementation. So I had one option to solve the problem and it was using two loops.

  • First loop for populating all values to limit with max_children value. (Cost:total_value_count times)
  • Second loop to calculate the output of the first loop. (Cost:max_children times)

Total Cost: total_value_count + max_children times. And I changed every #pick method for every SortMode with 50ccea commit. Here is the example query:

POST products/_search
{
"sort": [
{
"variants.price": {
"order": "desc",
"nested": {
"path": "variants",
"max_children": 1
}
}
}
]
}

After the commit, we discuss with pull-request reviewer and we decide to change iteration last-to-first to first-to-max_children to get better performance. So with this change, we needed to reverse nested field in my document. Because FIRST mode became LAST. It works fine but it is hard to understand even with detailed documentation. Because of these complexities, reviewers decided to preserve the order of the nested documents at index time in version 6.5.

After this change in Lucene indexing, reverse value problem solved, so LAST mode became FIRST again.

;)

Finally, my commit about max_children accepted and merged. This feature will be available on Elasticsearch with the version 7.0.

In conclusion, it was a very good experience for me and I had fun and beneficial time while discussing my changes with Elasticsearch members. Elasticsearch members are very helpful to contributors and supports contributors. Thanks them for their patience.

Merged ❤

Issue and Pull Request:

First SortMode for Nested Fields # issue

Add max_children limit to nested sort # pull request

References:

Elasticsearch Sort Documentation

Elasticsearch Source Code