A look at internals of Elasticsearch

Abhinav Piprotar
Sixt Research & Development India
5 min readFeb 10, 2022

Elasticsearch is a popular distributed search and analytics engine and is used widely here at Sixt. It is also used by companies like Netflix, Github, Stackoverflow etc. In this post we will talk about the distributed aspects, data storage, and read — write operations of elasticsearch.

Elasticsearch Cluster:

A node is an individual instance of Elasticsearch and a group of nodes form an Elasticsearch cluster.

Data Node: Responsible for storing the data and the inverted index.

Client Node: If a node is not configured as either master or data node it becomes a client node and coordinates incoming requests with other nodes.

Master Node: It is in-charge of managing cluster-wide operations like creating/deleting an index, adding/removing nodes from cluster, assigning shards to nodes etc.

Cluster State is a data structure that contains cluster level information about cluster settings, cluster nodes, indices and their settings and mappings, and the shards of the indices and the nodes where the shards reside. Only master node can update the cluster state and then it broadcasts the updated cluster state to all other nodes. Can master node become a bottleneck? Since cluster state is stored by every node, all the nodes have the necessary information about the other nodes, indices and shards. A client can thus connect to any node and that node coordinates with the other nodes holding the required data to serve the client request. The master node does not need to be involved in read and write operations and will not become bottleneck as traffic grows.

Inverted Index:

Elasticsearch is built on top of Apache Lucene, a full-text search-engine library which uses inverted index data structure to provide low latency search functionality. A document is the unit of data in elasticsearch. To create an inverted index, the document data is split into separate words called tokens and a sorted list of all unique tokens and their documents is created.

Doc1: Red shirt is expensive.
Doc2: White shirt is cheap.

The tokens are also normalized to a standard form to improve the searchability. For instance, the default standard analyzer in elasticsearch splits the text on word boundaries then lowercases all the tokens. The same analyzer is used on queries made on the documents to provide relevant search results.

Inside a Shard:

Documents are stored and indexed in shards and shards are allocated to nodes in a cluster. A shard can be primary or a replica shard. The number of primary shards in an index is fixed at the time of index creation and thus determines the maximum amount of data that can be stored in an index. Replica shards serve read requests and provide redundancy in case of hardware failure. The number of replica shards can be changed anytime unlike the primary shards.

Write Operations:
A lucene index or shard is a collection of segments plus a commit point (a file that lists all segments).

When a write operation is requested on a coordinating node, it routes the request to the appropriate primary shard. The shard is determined by the below formula:

shard = hash(routing_string) % number_of_primary_shards

Document Id is used as default routing string.

The node hosting the concerned primary shard receives the request from the coordinating node. The request is appended to translog, or transaction log and the document is added to in-memory buffer. If the request is successful on primary shard, it is sent parallelly to replica nodes.

On a regular interval (default 1 second), the shard is refreshed i.e the in-memory buffer is written to new segment without fsync and the new segment is opened to make it visible to search.

Translog is fsynced to disk at default interval of 5 seconds for its safety.

Every 30 minutes or when translog becomes too big, the filesystem cache is fsynced. This is called flush process. During the flush process, in-memory buffer is cleared after its content are written on a new segment. A new commit point is created with all the segments fsynced to the disk. The old translog is deleted and a new one is created.

Deletes And Updates:
Segments in elasticsearch are immutable so nothing can be deleted from them. Every commit point has .del file which lists the documents that have been deleted. A delete request marks the document as deleted in .del file. So a deleted document can still match a query but is removed from final result.

In case of an update, the old version of document is marked as deleted in .del file and a new version of document in created in a new segment.

Segments Merging:
Refreshing at regular interval of 1 second creates too many segments. Segments consume memory and CPU cycles. Each search request has to search through every segment which can slower the search. Elasticsearch selects smaller size segments and merge them into a bigger segment in background without affecting read and write operations.

Read Operation:
Unlike write operations that happens on a single shard (and its replicas), the search request needs to look in every shard(primary or replica) in an index to see if there are any matching documents. Results from multiple shards needs to combined into a single sorted list so as to return the results. Search is performed in two phases:

Query Phase:
The coordinating node sends the request to all the shards in an index (either primary or replica). Each shard performs the search and builds sorted list of the results. It returns the lightweight list of document Ids and other relevant values like score to the coordinating node. The coordinating node creates a global sorted list by combining search results from each shard. Only top matching documents are retained based on requested result size.

Fetch Phase:
The coordinating node sends multi get request to the concerned shards according to the documents in the global sorted list. The shards then send enriched documents to the coordinated node and the results are then returned to the client.

--

--