What happens when a document is indexed in Elasticsearch?

Phase 02 — indexing, mapping and analysis — Blog 06

Arun Mohan

Published in

elasticsearch

5 min readDec 9, 2017

Introduction

From this blog, we are entering the phase 02 of this blog series,named “indexing, analysis and mapping”. In this phase you will learn more in detail about the process of document indexing and the internal process during an document indexing like the analysis, mapping etc. This short first blog to the phase 02 series will introduce you to the general process that is happening when a document is indexed in Elasticsearch.

Indexing a document and beyond

Let us index a document like below to Elasticsearch

curl -XPUT localhost:9200/testindex0201/testtype/1 -d ‘{
 “name”: “Arun Mohan”,
 “age”: 31
}’

Now when we can take the head plugin (more about the head plugin here), and view the indexed file in the index name “testindex0201”. So we indexed the document in elasticsearch in a snap, but there are lot many questions relating to this simple process. Some of these questions are:

where does the document I just created resides on the disk?
if it resides on disk, can I change the location?
is Elasticsearch storing the data as such or modifying it during the indexing process?
how come the search on documents faster in Elasticsearch?

Let us see what the answers for these questions are in the coming sections.
1. Where do documents get stored in the disk?

Here in this blog we are talking about the .deb file installation of elasticsearch . The default data path for elasticsearch is the folder “/var/lib/elasticsearch/”. Here one thing to be noted is that, before the vesions 5.x, under the above path, a folder in the name of the cluster to which the node belongs to was created and the data was stored inside it. From 5.x this was deprecated and no cluster name is stored. So in effective, the data path for 5.x and the versions before would look like below:

2. Changing the location of the data

We have seen where in disk the data is stored by Elasticsearch. This brings to a basic question,that whether we can change the location of the data?. The need for such customisation occurs in situations like the default path of data containing less storage and we decide to mount another volume exclusively for the data storage.

Elasticsearch definitely provides the option for customising the data paths. The data path can be configured in the elasticsearch.yml file located in

/etc/elasticsearch/elasticsearch.yml

Here the custom path is to be applied against the “path.data” field.

3. Data indexing process

The below diagram shows a high level flow of the data indexing process in Elasticsearch.

As you can see from the above picture, the document is not as such indexed in the Elasticsearch, rather it is being performed some actions on it by the Analyzer component and split into tokens/terms. These terms are then stored in the disk as the inverted index. So let us brief a bit up on the Analyzer part of the above diagram (we will be revisiting Analyzers in detail in the next blog).

To start on Analyzers, it is good to have a small recap on how and what the input documents look like. Documents are JSON objects having their own set of key value pairs. In the example given above, we have two keys named “name” and “age”, also their values too. So when a document is to be indexed to Elasticsearch, the Analyzers part of the Elasticsearch takes each key, split them at certain delimiters (there are default delimiters such as space, full stop etc). The output of this splitting is called as tokens. Each token is then applied certain filters (standard filtering procedure includes the lower casing of all the splitted tokens). So in effective, a key consists of a series of tokens after the Analyzer is done analysing. These tokens after analysis is called as terms. These terms are then stored against that field (key) in an inverted index. Let us see what is this inverted index in detail in the coming section.

4. Elasticsearch speed and Inverted Index

As discussed in the above section, the “terms” generated by the Analyzer is sent to an inverted index. Now its time to have a detailed look at this term, the “inverted index”.

Inverted index is the primary reason for the robustness and speed of the Elasticsearch’s search. It is best explained with examples. Consider there are two documents like below:

Document 1

{
 “name” : “this is a cat”
}

Document 2

{
 “name”: “there is a cat and a dog”
}

After analysis, the terms in the documents would be as follows

Document 1

Field Terms

“name”: “this”,”is”,”a”,”cat”

document 2

Field Terms

“name” : “there”,”is”,”a”,”cat”,”and”,”dog”

Now let us combine the above two table of terms for the field “name” and make it a single one like below:

Now the above table is called the inverted index for the field “name”. The advantage of using inverted index is that, the search term can be looked up in the “terms” column and then, if there is a match, finding the documents where the search term is present is a very simple job of a look up at the corresponding column. For example, if in this case there are say one million documents. In the conventional method, we have to traverse each document and also each field ‘s value to retrieve the matching search results. With inverted index, we are only searching a selected set of terms and then since there is no duplication of terms, if we find a match, we look up which all documents have the terms in the “documents” column and then pass these documents as the result. Thus a lot amount of search time is saved compared to the conventional method.

If a data set contains n documents and if the number of fields in these documents are m, then the total number of inverted indices generated for the data set is equal to n*m.

And each inverted index is further optimised by avoiding common stopwords (like “the”,”is” etc) and thus creating a very short list of terms. This makes the search super faster.

Conclusion

Here in this blog, I have provided a brief picture of the indexing process in Elasticsearch. Now in the next blog to the phase two series, we will look in greater depths of the analysis process in Elasticsearch. We will see the components of Analyzers, the types of Analyzers and tokenisers and many more with practical examples.