Our is Elasticsearch !

Originally published at reputationvip.io on July 4, 2016.

Welcome back ! This article, the third one of my set of articles about Elasticsearch, will introduce some more advanced concepts about Elasticsearch.

In the two last articles, I guided you through the basic configuration of an Elasticsearch cluste, how to do mapping, how to perform CRUD operations, and finally, how to execute simple full-text search queries.

If you don’t remember well about these notions, I recommend you to read the previous articles:

Also, I would like to remind you that a Github repository is available for this set of articles. It contains every single query I use, along with a turnkey Elasticsearch cluster running under Docker. It is available here: https://github.com/quentinfayet/elasticsearch/tree/3.0

ROADMAP

As I just said, this article adds itself to the two previous ones. No more waiting, let’s take a look at the menu.

  • Routing will be the first subject I will talk about in this article. I will give you detail on why and how you can configure the way your data are spread among the cluster.
  • Parent-child relationships and nested objects will be my second subject. Indeed, we haven’t gone through every Elasticsearch data types yet.
  • Scoring will be my third theoretical subject. I will detail you how much the choice of a scoring function is important.
  • Compound Queries allow you to combine multiple queries, or to alter the result of other queries.
  • Scripting allows you to perform great operations during your queries.

Ok, Elasticsearch fans, get ready to dive even deeper in the fabulous world of Elasticsearch!

Routing

Here we are, back in some theoretical notions for a while.

The first notion I will talk about is routing.

The “classical” routing

In a “classical” normal configuration, Elasticsearch would evenly dispatch the indexed documents among all shards that compose an index.

You can notice that I used “all shards that compose your index”, and not “cluster”. Indeed, each node of the cluster may represent a shard for a given index, but an index may not be “sharded” on every nodes.

With this configuration, documents are spread on all shards, and each time you query an index, then the cluster has to query all the shards that compose it, which may not be the desired behavior. Imagine now that we could tell the cluster, for each index, where to store the data, according the routing value we give it. The performance of an index using this configuration may be higher.

This mechanism — to chose where to store documents — is called routing.

Well, you may wonder how Elasticsearch decides where to store a document. By default, Elasticsearch calculates the hash value of each document’s ID, and on the basis of this hash, it will decide on which primary shard the document will be stored. Then, the shard redistributes the document over the replicas.

I did talk a bit about the way Elasticsearch is handling queries. Actually, the way it handles queries depends on the routing configuration. So with the default routing configuration, Elasticsearch will have to query each shard that compose your index (the query actually involve the score of the document). Once Elasticsearch gets the score, it will re-query the shards considered as relevant (shards that contains documents that match the query). Finally, Elasticsearch will merge the results, and send it back to you.

Routing Value

Manually defining the routing value

The first way we can go with routing is to manually define a routing value. Routing value is a value that indicates Elasticsearch on which shard to index a document (and thus, which shard to query).

Defining a routing value makes you indicate this value each time you are querying the index.

To index a document by providing its routing value, the request would look like this:

The query type is PUT

http://localhost:9200/index/type/ID?routing=routingValue

In the query above, of course, index, type, and ID should be replaced with the document's index, type and ID. What's interesting is the routing parameter. Here, routingValue stands as its value, and should be replaced by the value of your choice. Note that this value can either be digits (integer, long, float, ...) or a string

Using this method, you will also be required to provide the routing value, under a HTTP GET parameter, on each query you are making to the cluster.

Elasticsearch may store documents that have different routing values on the same shard. As a result, if you don’t provide the routing value each time you query Elasticsearch, you may have results that come from totally different shards.

That is why this method is not the most convenient.

Semi-automated routing value

Routing stands to group the documents that have something in common. For example, in my previous article, I stored document that are Game of Thrones’ characters. I could use their house as a routing value. That would result in documents that have the same house value to be stored on the same shard.

Semi-automated routing value is all about this: find a common ground (a common field for each document that should be stored together, that would have the same value).

And using this field that documents have in common, we can indicate the cluster to use it as a routing value.

To indicate which field to be used as the routing value is done when defining the mapping, I talked about mapping in the previous article.

When defining the mapping for a given type, you can add the _routing value, which is a JSON object that describes your routing rule. This JSON object is composed of two fields: the path field contains the name of the document's field containing what should be used as the routing value; the required field that says whether or not the routing value is required when performing index operation.

For example, considering the game_of_thrones index with character type, I could have add this to my JSON mapping object:

Routing example

The above JSON object tells Elasticsearch to use the house field of my character type document as the routing value.

Limits of routing

Well, maybe the title is a little exaggerated. There are no real “limits” to the routing features, just some facts that you should take care about.

The most important amongst them, is called hotspot. A hotspot is a shard that handles way more documents than other shards. Using the classical routing, Elasticsearch will distribute the documents evenly among the shards, so hotspots won’t show up easily. But when using manual or semi-automated routing, the documents may not be spread evenly among the shards. That could result in some shards to handle a lot of documents.

Unfortunately, there are no ways to automatically handle these hotspots.

The solution stands on the client’s side. Indeed, your application (or whatever indexes documents on your index) has to identify routing values that will support too much documents. A solution may be to create a special index for these documents, and then using aliases to make it transparent to your application.

Parent-child relationships and nested objects

Through the two previous articles, I talked a lot about indexing, mapping, how to fill your database with documents, how to update them, how to perform CRUD operations on them, and finally, how to use basic full-text search features.

However, a question might have come to your mind: How to manage relations between documents?

In this chapter, I will go through the main principles of indexing more complex documents, and how to define the relations that bind them together. Also, I will introduce you to the nested types of Elasticsearch, and how to index non-flat data.

Nested types

The definition of the nested type is quite simple: It is an array of objects that lives inside a document.

As you know, Elasticsearch works on top of Apache Lucene. Yet, Apache Lucene doesn’t know anything about objects that live in other objects, so that the job of Elasticsearch is to flatten them.

In other words, if you try to index a document that contains an array of objects, by default, each field of each object of the array will be inserted in a field in your top document that contains an array of each value that correspond to this field’s name in the objects of the array. Is that clear? I think it is not. Let me give you an example:

Let’s assume that our character type documents have a field, named weapons that contains an array of objects, each of them would be a weapon, as following:

Nested document

As you can see, the document Arya Stark has a field named weapons, that contains two objects. The first one is a sword (the famous sword named "Needle"), and the second one is an axe named "My beloved Axe" (thanks to my imagination for this - very original - name).

If you index this document in the cluster, without specifying in the mapping that the field weapons is of nested type, then the indexed document will look like this:

Indexed nested document

As you can see, each field of the objects contained in the array of objects has been inserted into an array of values, which field’s name is composed of the array object’s field name. The two field named type, of the weaponsarray for example, result into a new field named weapons.type, which is an array of the former values contained in the type field of the weapons array.

Here, we lost any relation between the fields, because the objects have been flattened.

The solution to this problem is, in the mapping, to define the field weapon as a field of nested type:

Defining a nested field

And, with this mapping, the nested object will not be altered in any way.

Parent-child relationships

As we just saw, nested type is a way for a document to embed one or more “sub-documents”. But with nested documents, each sub-document lives in its parents. In other words, you cannot change one of the sub-documents without reindexing the parent. Also, if you have a large amount of sub-documents, it might become tricky to add documents, update them or delete them.

Here comes the parent-child relationship. With this relationship, a document is bound to another, but it still remains a single entity.

Pay attention though, if the parent document has a routing defined in its mapping, the child has to follow the same routing. Indeed, Elasticsearch maintains a map of parent-child relationships to make search faster, but it implies for the parents and the children to be indexed on the same shard. What it means is that if you try to index a orphan child document, Elasticsearch will require you to precise the routing value.

Define the mapping

First thing to do when we want to introduce a parent-child relationship is to write the mapping of both the parent and the child. In our case, the parent (which is character type) is already defined, so that we just have to define the child's mapping.

For example, let’s say that we want to store the animals that company our characters. We would store them in the same index, game_of_thrones, but under the animal type. What we need to do is to update our mapping to include the new type, and define the relationship between character and animal:

Parent / Child mapping

As you can see, under the mapping of the character type, I added a new type, which is animal. I would like to draw your attention on the "_parent" field. As you can see, I defined the "type" field inside it to "character", which tells Elasticsearch that the parent type for animal is character.

You should already know the other fields, such as "dynamic" and "properties", so I will not tell anything about it now.

The animal type will carry two fields:

  • type which represents the type of the animal
  • name which represents the name of the animal

Index a child

Well, now that the mappings are defined, let’s index some children. For my example, I will talk about two famous animals in Game of Thrones: Nymeria, which is Arya’s direwolf, and Ghost, Robb’s direwolf.

To index a child and build the relationship with its parents, you need to specify its parent’s ID in the POST request:

Request type is POST

http://localhost:9200/index/type/?parent=ID

The parent parameter specifies the parent's ID.

The tricky case of routing

Previously, we activated routing on our game_of_throne character type, by using house as the routing value. And, you may remember that I told you that when it comes to parent-child relationships and routing, the things might get delicate. Indeed, both parents and children have to be stored on the same shard. When default routing is used, everything is simple, because Elasticsearch uses a hash of the parent's ID to build the routing value, resulting into the children and the parents to be stored on the same shard.

Because we defined house as the routing value, we need to use it when indexing children, by specifying the routing parameter in the request's URL.

Now, let’s index Arya and Jon’s direwolves:

You may have noticed that I’ve used %20 in the URL. That's because the IDs of the parent documents are the character's names. Thus, there is a space between the first and last name, which is represented by %20 in the URL.

Query a document with children

Elasticsearch created two special filters, which are has_child, to query a document by searching data into its children, and has_parents to query a document by searching data into its parents. Note that while has_child will give you the parent document, has_parent will give you the child document.

Retrieve the parent document

Let’s start by retrieving the parent document, with the has_child filter.

As a single type of parent document can have several types of children documents, the has_child query has to know the type of child you are interested searching data into.

For example, let’s say that we want to get the character which animal's name is Nymeria (so, we are looking for Arya Stark).

Our query will be the following:

Retrieving document with child properties

It is a simple query, with the has_child filter applied on a term query based on the name field of the animal type.

The query is available at queries/DSL/query_has_child.json, let's run it:

As the answer, we got the following JSON:

As you can see, we successfully retrieved Arya Stark from her direwolf, Nymeria.

Retrieve the child document

On the other hand, we can retrieve the children documents by querying the parent’s data. For example, let’s say that we want to retrieve every animal that is a child of a character document which house field is "Stark".

Retrieving a child document with parent’s properties

You should recognize that the structure is the same than the previous query, except that the filter we used is has_parent. We set the type to character, and the query is still a simple term query on the house field.

Let’s run the query:

As the answer, we got the following JSON:

We got two documents, corresponding to the two animal documents we indexed earlier. Indeed, both of them belong to a character document which house field is set two "Stark".

The Scoring

Great. With everything we talked about, I think you are ready to start using Elasticsearch. However, using Elasticsearch in its basics features might not be enough.

That’s why I am going to talk a bit about the scoring. Remember, we already talked about it in the very first article, and also in the second one. Scoring is the way Elasticsearch (and thus Apache Lucene) determines the relevance of a document against a given query.

If your goal is to use Elasticsearch at the very best of its capabilities, then you should know about scoring. I am now going to talk about the Apache Lucene scoring mechanism and the TF/IDF algorithm (though we already did talk about it).

Scoring factors

So the question is simple: How does Elasticsearch (Apache Lucene) calculate the score of a document against a query?

Well, there are a lot of factor that influence on the final score. The score depends on the documents, but also on the query (and so, comparing scores of documents on different queries doesn’t make much sense).

Before I talk about the factors, I want you to know that I have never talked about much of the things that I am going to talk about now in my previous articles.

Last warning, this part will be a bit mathematical and theoretical ; however, it is totally fine that you don’t perfectly understand what I will go through now. For me, it took a bit of time to understand it well.

Inverse Document Frequency

Inverse Document Frequency first. Maybe you remember that I’ve already talked about it in the first article. The inverse document frequency, IDF in the short way, is a formula, based on terms, that give a factor about how rare a term is. The higher the IDF is, the rarer the term is in the document.

Let’s have a look at the formula.

With i the term. Here,

represents the total number of documents for a given type type

simply is a complicated way to represent the number of documents in which the term appears.

But this is only the theoretical formula. In practice, this formula has a weakness: What happens when :

In other words, what if the term doesn’t appear in any document? It would result in dividing by zero, and this is simply… Not possible.

So in practice, we add 1 to this value. The final formula is:

Oh… I see you! You’d like an example! Well, I’m in a good mood today, so let’s go!

Let’s consider the 3 following documents:

  • Document 1: “Hello, my name is Arya”
  • Document 2: “Robb was part of the Stark family”
  • Document 3: “The Stark family is not really lucky…”

Also, we will consider that a term is a word. For example, “Hello” is a term. We want to calculate the IDF value of the term “Arya” against the 3 documents.

In this case,

is 3 (indeed, we have 3 documents).

Also,

is equal to 1, because the term “Arya” can be find in 1 documents (Document 1 and Document 2).

So, the IDF value for the term “Arya” against these documents is

And that’s it! Quite simple, isn’t it?

Term Frequency

The IDF by itself is not enough. Indeed, IDF gives a score for a term against all documents, so the score is not relevant without a moderation. Here comes the term frequency (TF). That’s why we are talking of TF/IDF, and not only TF or IDF.

The term frequency of a term, as it name suggests, is the frequency of the term in a given document. Some scientists invented a complicated formula to describe it, which is:

Behind this complicated formula, it is simply a frequency calculation: The number of times the term appears in the document, divided by the total number of terms in the document.

With i the term and d the document.

So,

is the number of times the term appears in the document, and

is the sum of occurrences of each single term in the document (thus, the total number of terms in the document).

Let’s resume with the 3 documents we used to calculate IDF. Our query still is “Arya”.

  • Document 1: “Hello, my name is Arya”:

(We don’t consider a coma as a term)

  • Document 2: “Robb is part of the Stark family”:
  • Document 3: “The Stark family is not really lucky…”:

From now, we can calculate the TF/IDF for each document, as the TF/IDF is simply the following:

  • Document 1: “Hello, my name is Arya”:
  • Document 2: “Robb is part of the Stark family”:
  • Document 3: “The Stark family is not really lucky…”:

Document Boost

Document Boost is something that I have never talked about. This is an artificial way to influence the scoring value for a document. The Document Boost is simply a boost value that can be given to a document during indexing.

Field Boost

In the same idea as the Document Boost, the Field Boost is a boost value that can be given to a specific field during indexing.

Coordination Factor

The coordination factor is quite simple: The more searched terms the document contains, the higher the coordination factor is. Without the coordination factor, the combined weight value of the matching terms in a document would evolve in a linear way. Using the coordination factor, the weight value is being multiplied by the number of matching terms in the document, and then divided by the number of terms in the query. Let’s reconsider our documents. Let’s imagine that our query would be “Arya Stark family”. Also, we consider each term has a weight of 1.

  • Document 1: “Hello, my name is Arya”
  • Document 2: “Robb is part of the Stark family”
  • Document 3: “The Stark family is not really lucky…”

Without the coordination factor, the weight scores would be:

  • Document 1: “Hello, my name is Arya”: Weight Score = 1
  • Document 2: “Robb is part of the Stark family”: Weight Score = 2
  • Document 3: “The Stark family is not really lucky…”: Weight Score = 2

As you can see, the weight score is just the addition of the score of each term of the query that is present in the document.

Now, with the coordination factor:

  • Document 1: “Hello, my name is Arya”: Weight Score =
  • Document 2: “Robb is part of the Stark family”: Weight Score =
  • Document 3: “The Stark family is not really lucky…”: Weight Score =

As you can see, the evolution of the score is not linear anymore. Indeed, the Document 2 has a score of 1.33 because it containes 2 of the 3 terms of the query, while the Document 1 has a score of around 0,33 because it contains only one term of the query.

Query Normalization Factor

As I said, it is non-sense to compare the scoring value of a document against a given query to the scoring value of the same document against another query.

But, the query normalization factor is an attempt from Elasticsearch to “normalize” a query, so that the score of a given document can be compared against different queries.

Careful though, the query normalization factor is not really relevant, and still, you have to be really careful when comparing the score of a document against different queries.

As it is not very important, I won’t talk about it here.

Field-length norm

Basically, it is the length of the field we are searching in. The shorter the field, the higher the weight. In other words, a term found in a field with small length will be given more weight than the same term found in a longer field.

The calculation is quite simple:

With d the document, and

the sum of terms in the document.

As you can see, the calculation doesn’t depend on the term, but on the document (on the field, actually).

Let’s calculate the field-length norm for our 3 documents:

  • Document 1: “Hello, my name is Arya”:
  • Document 2: “Robb is part of the Stark family”:
  • Document 3: “The Stark family is not really lucky…”:

As you can notice, the more terms in the document, the lower the field-norm length.

The final scoring function

With all of this, we can finally build the scoring function, which is:

Basically, the function is not really complicated: It takes the scoring factor that are specific to the document and the query (the normalization factor and the coordination factor), and multiply them to get a coefficient. This coefficient is then used with a combination (multiplication) of each scoring factors specific to terms of the query.

Compound Queries

Well, after this little theoretical part, it is time for us to be back in the essence of Elasticsearch: the full-text search.

I would like to introduce to you the compound queries. In the previous article, we went through some basic queries available in Elasticsearch, such as the term query or the match query.

But what if we want to connect multiply queries between them, to perform more precise search?

The boolean query

The first compound query I want to introduce is the boolean query. The boolean query allows you to connect multiple queries to get a boolean value.

With three keywords that are should, must et must_not, you will be able to define some inbound rules to include or exclude a document from the results.

Each of the keywords may be present multiple times in a single query, as the query is processed as a stream: Each keyword is applied on the result of the previous keyword.

Let me give you a simple example: let’s say that we want to retrieve character that must match the term "Stark" in their biography field (it is a match query), but their age field must not be between 20 and 30.

Let’s take a look at the request:

Boolean Range request

As you can see, I indicated the “bool” type for the query, followed by as much boolean clauses as I want. My first clause is the match query I talked about before, matching the term “Stark” on the biography field. Then, I indicated a must_not clause, and I used a very new query type: the range query, that allows us to specify a range to match (the range can be performed on integer or date fields, but also on string fields).

The query is available at queries/DSL/compound_query_bool.json, let's run it:

And, the result of the query, once executed, is the following:

I got 6 results. You can notice that the result doesn’t include the character “Robb Stark”, as its age in our dataset is defined to 22 (so it doesn’t satisfy the must_not clause).

The Function Score Query

The Function Score Query is one of the most powerful compound queries in Elasticsearch. Basically, it allows you to define a new scoring function! Isn’t it amazing?

As we saw right before, the relevance scoring function of Elasticsearch is a mix between a lot of mathematical coefficients.

But, what if the default relevance scoring function is not relevant to you? What if you’d like to define your own relevance scoring function? Well, that is possible!

The compound query Function Score Query is one of the most complex query in Elasticsearch, as it is really complete, and allows you to manipulate the score in a lot of different ways.

To perfectly understand how it works, we should first take a look at the query’s structure:

So, several points to be discussed:

  • the query field is the place where you define your query.
  • the boost field is the place where you define a boost value that will be applied to the whole query.
  • the functions field is the place where you will define the functions that calculate the score. As you may have noticed, this field is an array, which means that you can define one or more relevance scoring functions.
  • the boost_mode field defines the type of boost mode (multiply, replace, ...) that you will use.
  • the max_boost is an optional field, and as it name suggests, it allows you to define a maximum score.
  • the score_mode is an optional field that allows you to define the score mode (max, multiply, ...).
  • the min_score is an optional field that allows you to define the minimal score.

As I told you, several relevance scoring functions can be defined, and you can choose which one to apply accordingly to a given filter.

The possible scoring functions

As there is 5 different possible scoring functions, I am not going to introduce them all. I will just go through one of them, which is the script_score function.

The script_score function

The first scoring function I am introducing is the script_score function. It allows you to manipulate the score with scripting.

If you don’t remember well about scripting, I did talk a bit about it in the second article, here

When defining a script_score function, different variables will be available to you:

  • _score is the score calculated with the default algorithm of Elasticsearch (Apache Lucene)
  • doc allows you to access the document's fields.

Also, you will be able to define some parameters, with the params field.

Let’s have a look to the structure:

Scripting a scoring function

be careful though, the result of the script is not going to be the final score of the document by default. If you wish this score to be the final score, you’ll have to set the boost_mode to "replace".

Let’s take an example: I want the final score of my query to be the age of my character divided by two. Quite simple. My query will be a term query on the house field, to "Lannister".

The query would then be the following:

Well you can see it is quite simple. Under the query field, I defined a match query on the house field of the document. Then, inside the functions array, I added an object that contains my script_score script. This script takes the value of the field age of the document, and divides it by two. Finally, I set the boost_mode to "replace", so that the final score is the score calculated by my script.

The query is available at queries/DSL/compound_query_score_script.json, let's run it:

As you can notice, the score of the document is equal to half the age of the corresponding document.

Scripting

For the last part of this article, I decided to talk about scripting. Indeed, Elasticsearch comes with a very interesting scripting feature that allows you to return custom values, or to perform some operations such as custom scoring.

If you remember well, we already talked about scripting in the previous article, where I introduced a way to perform dynamic calculation of a virtual field.

Until Elasticsearch 1.3, the used scripting language was MVEL (an expression language for the Java Platform), but now, Groovy is used. Yet, some other languages can be used, such as Mustach, Javascript or Python (the two last ones require plugins to work).

If you are working under the docker cluster I provided, the scripting functions already are enabled in the elasticsearch.yml file. Otherwise, you will have to enable them with the following configuration line:

script.engine.groovy.inline.search: on.

There are three ways to load a script in an Elasticsearch query:

  • Using inline script by inserting the script line directly into the query
  • Using a file that contains the script, and indicating its name to the query
  • Using a special index named .scripts

As I already introduced the first way to perform scripting, I will now have a quick tour of the second method.

The sandboxed environment

When talking about data warehouse, or more commonly databases, security is a primordial point. Yet, scripting introduce a lot of security concerns, as it allows to perform some operations that are off the control of Elasticsearch.

What if you are using a language that contains a huge security breach, and that anybody could easily take control over your cluster?

That’s why Elasticsearch scripting feature is only working with sandboxed languages. Sandbox environment is a special environment used to run untrusted scripts, so that there scope is limited.

Scripting with file

When your needs are to use the same script at different point of your application, on different Elasticsearch queries, it might be useful to have your script stored in one unique file.

Script files have to be stored in a specific folder: config/scripts in the Elasticsearch directory.

I will use a very simple example. We just took a look over custom scoring functions, and especially the function_score query. We used an inline script under the script_score field. Yet, we can use a file instead of this inline script.

The script has to be written in a file (we will name it as score.groovy) located in config/scripts directory:

Scripting from within a file

Then, we can run our query by indicating the script_file field (simply filled with the script filename):

The query is available at queries/DSL/compound_query_score_scripting_file.json, let's run it:

The result is the same as the one above, when we executed the query with the inline script.

Also, if you are using other languages than Groovy, you can indicate the name of this language under the lang field under the script_score field. If you need to introduce some params, the params field can contain an object of which each field is a param name, and the corresponding value is the param value.

Scripting with index

As I said, you can store scripts directly into a dedicated index named .scripts. Thus, there is a special REST endpoint to manage the scripts, which is _scripts. A script is identified by its ID, and stored under a specific lang:

http://localhost:9200/_scripts/lang/ID

If you remember well, we disabled automatic index creation in our elasticsearch.yml file. The point is that the special .scripts index then cannot be created, we need to do it manually:

The previous line creates the .scripts index used by Elasticsearch to index scripts. We can now work with indexed scripts.

For example, I want to store the previous script. The language I used is Groovy, and I will give it the ID score.

I stored the script in a file name script.json available at queries/DSL/script.json.

The request to index the script is the following:

The response returned by the cluster is the same as if we were indexing a lambda document:

A simple GET request will return our script:

Additional information are present, such as the language in which the script is written.

Once the script indexed in the cluster, we can run the previous query by using script_id instead of script_file, and by providing the ID we gave to the script (score). We should also provide the lang field with the corresponding language of our script (groovy).

Scripting security

Scripting is a powerful feature of Elasticsearch. But with great power comes great responsibility. Indeed, scripting is powerful, but even sandboxed environment cannot stop all attempts to attack a cluster.

If you are concerned with security in Elasticsearch (which is primordial if you are willing to run Elasticsearch in a production state), a good start is this article on the Elastic’s blog.

On the other hand, if your interest is more about security research, a good start would be to look at some pull requests done on the Metasploit framework, like this one.

As I also have interest into security concerns, let’s have a bit of fun by having a look at how one of these exploits work.

First of all, as described in the pull request, this security breach on Elasticsearch has a CVE (Common Vulnerabilities and Exposures) code, which is CVE-2015–1427.

CVE has a website on which we can read the complete description of this breach: here.

The description is as follow:

The Groovy scripting engine in Elasticsearch before 1.3.8 and 1.4.x before 1.4.3 allows remote attackers to bypass the sandbox protection mechanism and execute arbitrary shell commands via a crafted script.

Well, the good point for us, as Elasticsearch user, is that this breach seems to be solved, as the concerned Elasticsearch versions are the ones bellow 1.4.3.

As you can also see, the vulnerability has been recognized by Elasticsearch, and can be found on the list on the official website, here.

This security breach is a good example that sandboxed environment doesn’t protect your cluster from everything. As any software or application, sandboxed environment may contain a security breach.

This breach is related to the Groovy script sandboxed environment, which contains a vulnerability that allows an attacker to execute shell commands on your cluster. Even if the shell commands are executed with the same user running Elasticsearch, an attacker may use other exploit to perform a privilege escalation and get the root privileges.

The related topic on PacketstormSecurity shows a Python script that runs the famous script:

As you can see, this is a simple request, coming along with a script_fields that describe a field named lupin. The content of this field is the malicious code:

java.lang.Math.class.forName("java.lang.Runtime").getRuntime().exec("ls -l").getText()

The principle is simple (even if it may change a bit according to the host operating system: The script (written in Java) gets the runtime instance of the JVM, and perform a simple exec() on it, which executes shell commands on the host.

I simply put a ls -l, which lists the content of the current directory, but you can imagine more complex operations, such as downloading a script from a remote server, script that would perform a privilege escalation, or open a backdoor on the host system.

Conclusion

Routing, relationships, scoring theory, compound queries and scripting are some advanced features of Elasticsearch. All of them demonstrate the pliability of Elasticsearch, and its capacity to respond to a lot of different use cases.

Like what you read? Give Quentin Fayet a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.