In depth guide to running Elasticsearch in production

This article first appeared on my blog in February 2020.

If you are here, I do not need to tell you that Elasticsearch is awesome, fast and mostly just works.
If you are here, I also do not need to tell you that Elasticsearch can be opaque, confusing, and seems to break randomly for no reason. In this post I want to share my experiences and tips on how to set up Elasticsearch correctly and avoid common pitfalls.
I am not here to make money so I will mostly just jam everything into one post instead of doing a series. Feel free to skip sections.

The basics: Clusters, Nodes, Indices and Shards

Elasticsearch is a management framework for running distributed installations of Apache Lucene, a Java-based search engine. Lucene is what actually holds the data and does all the indexing and searching. ES sits on top of this and allows you to run potentially many thousands of lucene instances in parallel.

The highest level unit of ES is the cluster. A cluster is a collection of ES nodes and indices.

Nodes are instances of ES. These can be individual servers or just ES processes running on a server. Servers and nodes are not the same. A VM or physical server can hold many ES processes, each of which will be a node. Nodes can join exactly one cluster. There are different Types of node. The two most interesting of which are the Data Node and the Master-eligible Node. A single node can be of multiple types at the same time. Data nodes run all data operations. That is storing, indexing and searching of data. Master -eligible nodes vote for a master that runs the cluster and index management.

Indices are the high-level abstraction of your data. Indices do not hold data themselves. They are just another abstraction for the thing that actually holds data. Any action you do on data such as INSERTS, DELETES, indexing and searching run against an Index. Indices can belong to exactly one cluster and are comprised of Shards.

Shards are instances of Apache Lucene. A shard can hold many Documents. Shards are what does the actual data storage, indexing and searching. A shard belongs to exactly one node and index. There are two types of shards: primary and replica. These are mostly the exact same. They hold the same data, and searches run against all shards in parallel. Of all the shards that hold the same data, one is the primary shard. This is the only shard that can accept indexing requests. Should the node that the primary shard resides on die, a replica shard will take over and become the primary. Then, ES will create a new replica shard and copy the data over.

At the end of the day, we end up with something like this:

A more in-depth look at Elasticsearch


node.master: true

On cluster start or when the master leaves the cluster, all master-eligible nodes start an election for the new master. For this to work, you need to have 2n+1 master-eligible nodes. Otherwise it is possible to have a split-brain scenario, with two nodes receiving 50% of the votes. This is a split brain scenario and will lead to the loss of all data in one of the two partitions. So don’t have this happen. You need 2n+1 master-eligible nodes.

How nodes join the cluster

Basically, Elasticsearch nodes talk with each other constantly about all the other nodes they have seen. Because of this, a node only needs to know a couple other nodes initially to learn about the whole cluster.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

This isn’t really a constant process: nodes only share information about other nodes they have discovered when they’re not part of a cluster. Once they’ve joined a cluster they stop this and rely on the cluster’s elected master to share any changes as they occur, which saves a bunch of unnecessary network chatter. Also in 7.x they only really talk about the master-eligible nodes they have seen — the discovery process ignores all the master-ineligible nodes.


Lets look at this example of a three node cluster:

In the beginning, Node A and C just know B. B is the seed host. Seed hosts are either given to ES in the form of a config file or they are put directly into elasticsearch.yml.

As soon as node A connects to B, B now knows of the existence of A. For A, nothing changes.

Now, C connects. As soon as this happens, B tells C about the existence of A. C and B now know all nodes in the cluster. As soon as A connects to B again, it will also learn of the existence of C.

Segments and segment merging

Again, a segment is an actual, real file you can look at in the data directory of your Elasticsearch installation. This means that using a segment is overhead. If you want to look into one, you have to find and open it. That means if you have to open many files, there will be a lot of overhead. The problem is that segments in Lucene are immutable. That is fancy language for saying they are only written once and cannot be changed. This in turn means that every document you put into ES will create a segment with only that single document in it. So clearly, a cluster that has a billion documents has a billion segments which means there are a literal billion files on the file system, right? Well, no.

In the background, Lucene does constant segment merging. It cannot change segments, but it can create new ones with the data of two smaller segments.

This way, lucene constantly tries to keep the number of segments, which means the number of files, which means the overhead, small. It is possible to force this process by using a force merge.

Message routing

If you are searching, then the ES node that gets the request will broadcast it to all shards in the index. This means primary and replica. These shards then look into all their segments for that document.

If your are inserting, then the ES node will randomly select a primary shard and put the document in there. It is then written to that primary shard and all of its replicas.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

The word “shard” is ambiguous here and although I think you understand this correctly it’d be easy for a reader to misinterpret how this is written. Each search only fans out to one copy of each shard in the index (either a primary or a replica, but not both). If that fails then the search will of course try again on another copy, but that’s normally very rare


So how do I run Elasticsearch in production?




ES is written in Java. Java uses a heap. You can think of this as java-reserved memory. There is all kind of stuff that is important about heap which would triple this document in size so I will get down to the most important part which is heap size.

Use as much as possible, but no more than 30G of heap size.

Here is a dirty secret many people don’t know about heap: every object in the heap needs a unique address, an object pointer. This address is of fixed length, which means that the amount of objects you can address is limited. The short version of why this matters is that at a certain point, Java will start using compressed object pointers instead of uncompressed ones. That means that every memory access will have additional steps involved and be much slower. You 100% do not want to get over this threshold, which is somewhere around 32G.

I once spend an entire week locked into a dark room doing nothing else but using esrally to benchmark different file systems, heap sizes, FS and BIOS settting combinations of Elasticsearch. Long story short here is what it had to say about heap size:

The naming convention is fs_heapsize_biosflags. As you can see, starting at 32G of heap size performance suddenly starts getting worse. Same with throughput:

Long story short: use 29G of RAM or 30 if you are feeling lucky, use XFS, and use hardwareprefetch and llc-prefetch if possible.

FS cache

Most people run Elasticsearch on Linux, and Linux uses RAM as file system cache. A common recommendation is to use 64G for your ES servers, with the idea that it will be half cache, half heap. I have not tested FS cache. However, it is not hard to see that large ES clusters, like for logging, can benefit greatly from having a big FS cache. If all your indices fit in heap, not so much.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

Elasticsearch 7.x uses a reasonable amount of direct memory on top of its heap and there are other overheads too which is why the recommendation is a heap size no more than 50% of your physical RAM. This is an upper bound rather than a target: a 32GB heap on a 64GB host may not leave very much space for the filesystem cache. Filesystem cache is pretty key to Elasticsearch/Lucene performance, and smaller heaps can sometimes yield better performance (they leave more space for the filesystem cache and can be cheaper to GC too).




Generally, you can put all your data disks into a RAID 0. You should replicate on Elasticsearch level, so losing a node should not matter. Do not use LVM with multiple disks as that will write only to one disk at a time, not giving you the benefit of multiple disks at all.

Regarding file system and RAID settings, I have found the following things:

  • Scheduler: cfq and deadline outperform noop. Kyber might be good if you have nvme but I have not tested it
  • QueueDepth: as high as possible
  • Readahead: yes, please
  • Raid chunk size: no impact
  • FS block size: no impact
  • FS type: XFS > ext4

Index layout


  • for write heavy workloads, primary shards = number of nodes
  • for read heavy workloads, primary shards * replication = number of nodes
  • more replicas = higher search performance

Here is the thing. If you write stuff, the maximum write performance you can get is given by this equation:


The reason is very simple: if you have only one primary shard, then you can write data only as quickly as one node can write it, because a shard only ever lives on one node. If you really wanted to optimize write performance, you should make sure that every node only has exactly one shard on it, primary or replica, since replicas obviously get the same writes as the primary, and writes are largely dependent on disk IO. Note: if you have a lot of indexing this might not be true and the bottleneck could be something else.

If you want to optimize search performance, search performance is given by this equation:

node_throughput*(number_of_primary_shards + number_of_replicas)

For searching, primary and replica shards are basically identical. So if you want to increase search performance, you can just increase the number of replicas, which can be done on the fly.


30G of heap = 140 shards maximum per node

Using more than 140 shards, I had Elasticsearch processes crash with out-of-memory errors. This is because every shard is a Lucene instance, and every instance requires a certain amount of memory. That means there is a limit for how many shards you can have per node.

If you have the amount of nodes, shards and index size, here is how many indices you can fit:

number_of_indices = (140 * number_of_nodes) / (number_of_primary_shards * replication_factor)

From that and your disk size you can easily calculate how big the indices have to be

index_size = (number_of_nodes * disk_size) / number_of_indices

However, keep in mind that bigger indices are also slower. For logging it is fine to a degree but for really search heavy applications, you should size more towards the amount of RAM you have.

Segment merging

There is a force_merge API that allows you to merge segments down to a certain number, like 1. If you do index rotation, for example because you use Elasticsearch for logging, it is a good idea to do regular force merges when the cluster is not in use. Force merging takes a lot of resources, and will slow your cluster down significantly. Because of this it is a good idea to not let for example Graylog do it for you, but do it yourself when the cluster is used less. You definitely want to do this if you have many indices though. Otherwise, your cluster will slowly crawl to a halt.

Cluster layout

Finally, master nodes are ideal candidates for seed nodes. Remember that seed nodes are the easiest way you can do node discovery in Elasticsearch. Since your master nodes will seldomly change, they are the best choice for this, as they most likely already know all other nodes in the cluster.

//EDIT, 25th Feb 2020:
Dave Turner mentioned in the comments:

Master-eligible nodes are the only possible candidates for seed nodes in 7.x, because master-ineligible nodes are ignored during discovery.


Master nodes can be pretty small, one core and maybe 4G of RAM is enough for most clusters. As always, keep an eye on actual usage and adjust accordingly.


  • number of segments
  • heap usage
  • heap GC time
  • avg. search, index, merge time
  • IOPS
  • disk utilization



Devops / AWS Freelancer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store