Elasticsearch Architecture — 2

Emre DALCI
8 min readJan 16, 2023

--

This article is the second part of the Elasticsearch architecture series.
(Refer to the first part of the article) In this article, we will talk about the high-level architecture of Elasticsearch.

High-Level Design

Segments

Shards create immutable file segments to hold the document onto a durable system (Refer to why are segments immutable). Each segment is an inverted index.

When a document is written to a shard, it is held in memory and committed to the disk on reaching a certain threshold. This avoids frequent writes to the disk thereby improving write performance. The downside is that any uncommitted writes to the shard will be lost on a failure. In order to overcome a potential data loss, Elasticsearch writes all uncommitted changes to a transaction log. In the event of a failure, data can be replayed from the transaction log.

Document in the transaction log is not searchable immediately unless a refresh operation is performed. By default, Elasticsearch does an index refresh every second. A refresh causes the document in the transaction log to be converted into an in-memory segment which then becomes searchable.

on refresh

Refresh does not commit the document to make it durable. Flushing is the process of making sure that the document is permanently stored in the shard.

on flushing

Segment Merging

Once a segment is created, its contents cannot be modified. Deleting a document is achieved by simply marking the document as deleted. Similarly updating a document requires deleting the old document and creating an updated document.

Also, with the automatic refresh process creating a new segment every second, it doesn’t take long for the number of segments to explode. Having too many segments is a problem. Each segment consumes file handles, memory, and CPU cycles. More important, every search request has to check every segment in turn; the more segments there are, the slower the search will be.

In order to optimize the number of segments, Elasticsearch runs merge processing behind the scene. Merging helps to combine similar size segments into larger segments.

This is the moment when those old deleted documents are purged from the filesystem. Deleted documents are not copied over to the new bigger segment.

Shards(Lucene Index)

Shards are the software components holding data, creating the supporting data structures (like an inverted index), managing queries, and analyzing the data in Elasticsearch. They are instances of Apache Lucene, allocated to an index during the index creation.

The shards are distributed across the cluster for availability and failover. Having multiple shards help with increasing the search speed as the operation can be distributed across multiple nodes.

In order to guard against data loss, Elasticsearch allows configuring replicas. This is an index-level setting and can be changed at any time. As duplicate copies of shards, the replica shards serve redundancy and high availability in an application. By serving the read requests, replicas enable distributing the read load during peak times. The replica of a respective shard is not co-located on the same node as the shard as it defeats the purpose of redundancy. For instance, if the node crashes, you would lose all the data from the shard and its replica if they were co-located. Hence the shards and their respective replicas are distributed across different nodes in the cluster.

Nodes And Cluster

A node is briefly clarified as an instance of an Elasticsearch server. When we start the Elasticsearch application, we are essentially initializing a node. By default, this node joins a cluster, a one-node cluster to be precise. We can create a cluster of any number of nodes based on our data requirements. We can also create multi-clusters.

Nodes in the cluster can be differentiated based on the specific type of operations that they perform. Although a node can perform any or all cluster operations, such a configuration could negatively impact the stability of the cluster. The standard practice is to designate different sets of nodes for different cluster functions.

Routing a Document to a Shard

When indexing a document, it is stored on a particular primary shard. How does Elasticsearch know which shard a document is stored?

Elasticsearch uses a routing algorithm to distribute the document to the underlying shard when indexing. Routing is a process of allocating a home for a document to a certain shard with each of the documents stored in one and only one primary shard. Retrieving the same document will be easy too as the same routing function will be employed to find out the shard where that document belongs to. The routing algorithm is determined by a simple formula:

shard = hash(routing) % number_of_primary_shards

The output of the routing algorithm is a shard number. It is calculated by hashing the routing value (default value is document id) and finding out the remainder of the hash when divided with the number of shards. The documents are evenly distributed so there is no chance of one of the shards getting overloaded.

The formula directly depends on the number_of_shards variable. This explains why the number of primary shards can be set only when an index is created and never changed: if the number of primary shards ever changed in the future, all previous routing values would be invalid and documents would never be found.

How Primary and Replica Shards Interact

Here is the sequence of steps necessary to successfully create, index, or delete a document on both the primary and any replica shards:

  1. The client sends a create, index, or delete request to the coordinator node(Node 1).
  2. The node uses the routing algorithm to determine where the document belongs (shard 0 for the below image). It forwards the request to Node 3, where the shard 0 is currently allocated.
  3. Node 3 executes the request on the primary shard. If it is successful, it forwards the request in parallel to the replica shards on Node 1 and Node 2. (If there is one replica, this is done in sync.)

4. Once all replicas have successfully performed the operation, they respond to the primary shard.

5. When the response is received by the coordinator node, it acknowledges the successful completion of the request to the client.

These indexing stages (coordinating, primary, and replica) are sequential. To enable internal retries, the lifetime of each stage encompasses the lifetime of each subsequent stage. For example, the coordinating stage is not complete until each primary stage, which may be spread out across different primary shards, has been completed. Each primary stage will not be complete until all replicas finish indexing the docs locally and responded to the replica requests.

Distributed Search Execution

A CRUD operation deals with a single document that has unique routing values. This routing means that Elasticsearch knows exactly which shard in the cluster holds that document.

Search requires a more complicated execution model because Elasticsearch doesn’t know which documents will match the query. They could be on any shard in the cluster. A search request has to consult a copy of every shard in the index or indices we’re interested in to see if they have any matching documents.

In the above image, Node A is the coordinator node, which receives the request from the client. It is chosen as a coordinator node for no specific reason other than demonstration purposes. Once it is chosen as the (coordinator) active role, it creates a replication group with a set of shards and replicas on individual nodes in a cluster that consists of the data.

Node A then formulates the query request to send to other nodes, requesting them to carry out the search. (If Node A has a role as a data node, it will also carry out the search in its own store to fetch the results.) Upon receiving the request, the respective node performs the search request on its shard. It then extracts the top set of results and responds back to the active coordinator with the results. The active coordinator then merges the data and sorts it before sending it to the client as a final result.

Conclusion

In this blog series, we have looked into the architecture of Elasticsearch. Elasticsearch is a great tool that can be utilized to solve search and analytics data needs. It can handle massive amounts of data, is highly scalable, and can support millisecond response times. It is architected to handle node and data center failures, they provide high availability and reliability for your production workload.

References

--

--