Elasticsearch and its internals working

Published in

Geek Culture

8 min readJun 21, 2020

The key to Elasticsearch is an inverted index that makes it better than other available traditional database systems out there. An inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs.

Basic terminologies:

Fields: Fields are the smallest individual unit of data in Elasticsearch. For example title, author, date, summary, team, score, etc.
Documents: Documents are JSON objects that are stored within an Elasticsearch index and are considered the base unit of storage. In the world of relational databases, documents can be compared to a row in a table.
Mappings: It's been deprecated in the latest version of Elasticsearch (i.e. 7.x). It’s like a schema in the world of relational databases.
Index: These are the largest units of data in Elasticsearch. They can be analogized with databases in the relational databases. They can be assumed as logical partitions of documents.
Shards: These are Lucene indices. It is the key thing that helps in scaling the Elasticsearch. Here we can split an index into multiple partitions and each partition can reside on a node to have better availability and scalability.
Segments: A shard is further divided into multiple segments. Each segment is an inverted index that stores actual data. A segment is immutable. These same size segments are compiled together to form a bigger segment after a fixed period of time to have an efficient search. This whole process is completely transparent to users and handled automatically by Elasticsearch.
Replicas: Replicas, as the name implies, are Elasticsearch fail-safe mechanisms and are basically copies of your index’s shards. This is a useful backup system for a rainy day — or, in other words, when a node crashes. Replicas also serve read requests, so adding replicas can help to increase search performance.
Analyzers: They are also relevant to have an optimized index and efficient search. It is being used at the time of indexing a document where it breaks down a phrase into their constituent terms. The standard analyzer is the default analyzer used by Elasticsearch which contains grammar-based tokenizers that remove common English words and articles.
Nodes: Each instance of Elasticsearch becomes a node. If one server is having 2 instances of Elasticsearch, there will be 2 nodes of Elasticsearch. It has the crucial task of storing and indexing data. There are different kinds of nodes in Elasticsearch:

a) Data node- stores data and executes data-related operations such as search and aggregation.

b) Master node- in charge of cluster-wide management and configuration actions such as adding and removing nodes.

c) Client node- forwards cluster requests to the master node and data-related requests to data nodes.

d) Ingestion node- This is used to do pre-processing before indexing. Logstash can be instantiated as an ingestion node. Each node is identified uniquely by its name (or id) and is able to become a master node by default.

10. Cluster: A cluster is comprised of one or more Elasticsearch nodes. Each cluster has a unique identifier that must be used by each node (which wants to be part of the cluster).

11. Translog: Lucene commits are too expensive to perform on every individual change, so each shard copy also writes operations into its transaction log known as the translog. The data in the translog is only persisted onto a disk with Lucene commit only. Any data written since the last translog commit will be lost in case of any JVM failure or shard crash.

a) index.translog.sync_interval- How often the translog is fsynced to disk and committed, regardless of write operations. Defaults to 5s. Values less than 100ms are not allowed.

b) index.translog.durability- Whether or not to fsync and commit the translog after every index, delete, update, or bulk request. This setting accepts the following parameters: (i) request (ii) async

c) index.translog_flush_threshold- Defaults to 512mb.

12. Flush: This process commits the transaction logs from memory to disk.

Let’s talk about the relation between node, index, and shard.

Let’s create an Elasticsearch cluster with 3 nodes( or instances). Let’s say the index is ‘MyIndex’ with 6 shards and 1 replica.

An Index is distributed into 3 nodes with 6 shards and 1 replica

Let’s add 3 more nodes to the cluster. Now, each node will have only a primary shard and one replica, This rebalancing completely transparent to users and is handled by Elasticsearch.

Now, if we 6 more nodes to the cluster, Each node will have only one shard either primary or replica.

Distributed search

Elasticsearch does parallel search as it has its data distributed among multiple shards for the given index. The client node makes a parallel request to each shard of the index and compiles the result to return to the client. This distributed search feature makes Elasticsearch super fast with billions of records in its data store.

Parallel requests on multiple shards of an index

How does Elasticsearch handle failure?

Suppose there are 4 nodes in an Elasticsearch cluster. An index ‘MySecondIndex’ is created with 2 shards and 1 replica. Each node will have either one primary shard or one replica shard as shown in the below diagram:

MySecondIndex has 2 primary shards with 1 replica each

Now, Let’s say Node1 goes down because of some hardware failure. Elasticsearch promotes replica shard S1 on node3 to primary shard and a new replica shard is created on node2 or node4. In the case that the primary itself fails, the node hosting the primary will send a message to the master about it. The indexing operation will wait (up to 1 minute, by default) for the master to promote one of the replicas to be a new primary. Elasticsearch does all these steps with the help of master nodes. After this process, the cluster will look like this:

The replica is promoted to primary shard on Node3 and S1 is re-replicated on Node2

Write path

At the index level:

A replication group using routing logic (based on the hash value of the target document) is identified, then the primary shard is identified in the resultant replication group.
The write operation is forwarded to the primary shard which does a lot of things to index the document E-2-E.
The primary shard validates the content and structure of the document.
Executes the operation locally.
Forward the operation to each replica in the current in-sync copies set. If there are multiple replicas, this is done in parallel.
Once all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client. Note: There is a configuration that is wait_for_active_shards, It can have any positive value up to the total number of replica+primary shards (or all). The default value is 1.

At the shard level:

The document is added to the translog of the shard and the client gets successful acknowledgment if async is passed in the request.
After 5s, the translog is committed to disk and creates a new immutable segment in the shard.
Each segment is an inverted index. It is like a map with the term as a key and list of the documents the term appears in as the value. It involves multiple steps to index a document. The analyzer plays a vital role in this process.
The document is split into tokens.
All the tokens are lowercased. (‘Keep’ becomes ‘keep’, ‘TajMahal’ becomes ‘tajmahal’, etc)
All stopwords are removed. (Ex: remove keywords like ‘a’, ‘the’, ‘to’, etc)
Then, token normalization is done. (i.e. English stemming and synonyms process- ‘Rainfall’ becomes ‘rain’, ‘jumped’ becomes ‘jump’, ‘books’ becomes ‘book’, weekend and Sunday mean the same thing).
If we just store the list of document ids as the value for each team, we won’t be able to serve phrase search precisely (i.e. books library). In order to serve this, Elasticsearch stores the term location in the document along with the document Id.

Read path

A replication group using routing logic (based on the hash value of the target document) is identified if it is a document search.
Check the translog and return it from there if it is not committed to disk yet.
Any shard (primary or replica) can serve the request to the client which makes it the fastest search data store. More replica shards can be added to serve more efficiently.
All shards of the index are searched in parallel and their result are stored in a priority queue of size N(defaults to 10). It’s a data structure to keep the best-matched documents from all shard results. The priority (or relevancy) of a document is calculated by T.F.( term frequency) * I.D.F (inverse document frequency).
Once the entire search is done, the documents from the priority queue are returned to the client.

Some important point to remember while using Elasticsearch in production:

It provides optimistic concurrency control. A version can be passed in the request while updating any document. It does not lock any shards or documents while updating.
All documents are immutable, can not be changed, update deletes an existing document(soft delete which gets removed at a later point to a time in the background). Therefore, we must always make sure we use a maximum of half of the available capacity in the machine.
Always keep more shards than available instances (or nodes), so that index can scale up in case of load by adding more nodes. Remember, #shards for an index can not be modified once created. Recommended #shards per node is 2:1.
Elasticsearch performs better with bulk operations. Try to index or search your documents in bulk if possible.
If exact field search is required, use filters instead of queries as filters are more efficient than queries. Filter results can be cached also.
3 master nodes cluster is the preferred one.
Disable _all field in the index and use copy_to option to copy fields which are required to copy to _all field. Every field’s data is stored in the _all field by default. This process is known as the blacklist approach. The whitelist approach is recommended to have an efficient index. It saves a lot of space.

Thanks for reading!