How to develop an Elasticsearch feature?
Elasticsearch is an open-source project that allows you to contribute. In this article, I will explain the problem we had about product sorting and how I solved it by contributing to Elasticsearch.
Problem — Sort parent documents by the first child of nested arrays
I wanted to sort parent documents by first child of a nested array. For example I have a mapping and two documents(iPhone X and Samsung Galaxy S8) like these:
PUT products
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "keyword"
},
"variants": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"price": {
"type": "integer"
},
"sortId": {
"type": "integer"
}
}
}
}
}
}
}PUT products/product/1
{
"name": "iPhone X",
"variants": [
{
"name": "iPhone X 256GB",
"price": 6900,
"sortId": 1
},
{
"name": "iPhone X 64GB",
"price": 8300,
"sortId": 2
}
]
}PUT products/product/2
{
"name": "Samsung Galaxy S8",
"variants": [
{
"name": "Samsung Galaxy S8 Altın",
"price": 3350,
"sortId": 1
},
{
"name": "Samsung Galaxy S8 Gümüş",
"price": 3550,
"sortId": 2
},
{
"name": "Samsung Galaxy S8 Siyah",
"price": 3400,
"sortId": 3
}
]
}
And lets try to sort documents descending with following query:
POST products/_search
{
"sort": [
{
"variants.price": {
"order": "desc",
"nested": {
"path": "variants"
}
}
}
]
}
It first brings iPhone X because iPhone X’s biggest price is 8300 and Samsung Galaxy S8's biggest price is 3550. This is the normal behavior of Elasticsearch.
However, I need to sort the products by the price of the nested document with lowest sortId
in variants array.
First, I tried sorting with scripts but when I used nested filters, the result was not correct.
Elasticsearch has SortMode feature for these situations. If Elasticsearch had a FIRST
SortMode, I could solve this problem.
Because of the problems above;
I opened an issue to Elasticsearch to propose the FIRST
sort mode.
Prepare to develop Elasticsearch feature
Elasticsearch is written in Java
and I chose use IntelliJ
for development. After cloning Elasticsearch project from its repo, I ran the following command to work on the project using IntelliJ
.
gradle idea
Solution 1—FIRST
SortMode
Elasticsearch is a really large project and it has complex project structure so it is very hard to start implementing directly. First, I dived deeply into the source code to find classes I need change. The file I first changed was following class:
server/src/main/java/org/elasticsearch/search/MultiValueMode.java
MultiValueMode.java
class contains all sort modes and their implementations in it. Some sort modes are only applicable for specific types and some line Min
and Max
are applicable for every field type. FIRST
mode also should be applicable for all field types.
To do that, I needed to override every MultiValueMode#pick
method, which was pretty easy to implement. I just iterated DocIdSetIterator
object through the last value. (Commits: 4079137, 1ef661)
If you want the
FIRST
value, why do you iterate to the last?- Because “Elasticsearch reverse to put all parents after the children”
We discussed about in the issue.
After the implementation, I opened a pull request to Elasticsearch to send my code to review. Following query is the new sort query that solves our problem:
POST products/_search
{
"sort": [
{
"variants.price": {
"mode": "first",
"order": "desc",
"nested": {
"path": "variants"
}
}
}
]
}
This solution worked well for us but they want a more generic solution and decided to limit nested array using a new property. So they proposed max_children
property, which is also fine to solve our problem because when we set max_children
to1
and it was going to work like FIRST
SortMode. So I started to work on the implementation of max_children
.
Solution 2 — max_children limit
When I started to implement max_children
, I faced with a big problem. Because Elasticsearch puts all values into DocIdSetIterator
in reverse order I needed to iterate values from last to first, which was not possible. If I knew how many values in this iteration, I could do the implementation. #pick
methods has startDoc
and endDoc
arguments and DocIdSetIterator
has #cost
method. When I use filters, these options failed for my implementation. So I had one option to solve the problem and it was using two loops.
- First loop for populating all values to limit with
max_children
value. (Cost:total_value_count
times) - Second loop to calculate the output of the first loop. (Cost:
max_children
times)
Total Cost: total_value_count + max_children
times. And I changed every #pick
method for every SortMode
with 50ccea commit. Here is the example query:
POST products/_search
{
"sort": [
{
"variants.price": {
"order": "desc",
"nested": {
"path": "variants",
"max_children": 1
}
}
}
]
}
After the commit, we discuss with pull-request reviewer and we decide to change iteration last-to-first
to first-to-max_children
to get better performance. So with this change, we needed to reverse nested field in my document. Because FIRST
mode became LAST
. It works fine but it is hard to understand even with detailed documentation. Because of these complexities, reviewers decided to preserve the order of the nested documents at index time in version 6.5.0
.
After this change in Lucene indexing, reverse value problem solved, so LAST
mode became FIRST
again.
Finally, my commit about max_children
accepted and merged. This feature will be available on Elasticsearch with the version 6.5.0
.
In conclusion, it was a very good experience for me and I had fun and beneficial time while discussing my changes with Elasticsearch members. Elasticsearch members are very helpful to contributors and supports contributors. Thanks them for their patience.
Issue and Pull Request:
First SortMode for Nested Fields # issue
Add max_children limit to nested sort # pull request