Paolo Ragone
Jan 12, 2017 · 8 min read

Elasticsearch is an awesome product. It’s a great search engine that sits on top of the great Apache Lucene project and makes index management a breeze.

Its very simple initial setup can mislead you into thinking that operations will be simple as well, but in reality they can be quite challenging. The biggest challenge is understanding how to scale it, how it will behave under load, how it will behave when there’s an issue and a node is down. The non-happy case.

The TL;DR of this article would be that as long as you’re operating under the capacity of your cluster Elasticsearch will scale almost linearly with the size of your data and throughput; but as soon as you hit the capacity of your cluster it will degrade exponentially. Understanding why this happens will help you do proper capacity planning.

There are a couple of basic concepts you need to grasp to properly understand how to scale Elasticsearch:

Index Sharding

One of the great features of Elasticsearch is that it’s designed from the ground up to be horizontally scalable, meaning that by adding more nodes to the cluster you’re capable to grow the capacity of the cluster (as opposed to vertical scalability that requires you to have bigger machines to be able to grow your capacity). We’ll see later what capacity actually means.

Elasticsearch achieves horizontal scalability by sharding its index and assigning each shard to a node in the cluster. The index is the internal structure that Elasticsearch utilises to store and locate documents in an efficient way (think of a book’s index)

Figure 1: Sharding of the document index and assignment to nodes: 5 shards

This allows each node to have to deal with only part of the full document index. Furthermore, Elasticsearch also has the concept of replicas, which are copies of shards. This allows fault tolerance and redundancy, as well as an increased throughput.

Figure 2: Shards and replicas — 5 shards, 1 replica

For example, Figure 2 shows a cluster that can handle the loss of any one node as another node will have a replica of any of its primary shards.

Distributed Search

Now that we know how the index is managed, we need to understand how searches are handled.

In this distributed environment, searches are done in two phases:

  • Query Phase: A new search is received, and it’s transformed into a set of searches (one on each shard). Each shard returns its matching documents, the lists are merged, rank, and sorted
    → The result of this phase is the list of documents ids that will be returned to the user
  • Fetch Phase: Get the documents by id from their owning shards and return to the client

Important to know also is that in Elasticsearch any node can receive a search request. That node acts as the coordinator for that search and sends instructions to the other nodes on what they need to do to be able to fulfill it.

Figure 3: How Distributed search works

In Figure 3 we see how distributed search works and how each node managed the search on the shards it has and reported back to the coordinator with the results.

Each of these searches is performed by a Thread in the “Search” thread pool of each node. How many threads are available in each node is configured in Elasticsearch on the setting (see more info here), and can be monitored with a call to the /_cat/thread_pool endpoint (as described in the docs)

Now that we have a good understanding of how document index is managed, and how a distributed search is done, we can learn about how to scale Elasticsearch

Defining Scalability

When thinking about scalability in Elasticsearch, you have 3 different dimensions to consider:

  • Index size: Being able to manage huge indexes (in the order of hundreds of Gigabytes or Petabytes)
  • Throughput: Being able to manage a lot of concurrent searches under a certain response time.
  • Cluster size: The number of nodes in the system

These three dimensions of scale are somewhat orthogonal, and an increase on one of them comes at the expense of the others.

In order to better understand scalability, let’s develop some intuitions on how this system behaves.

Scaling on index size

To understand scalability in terms of index size, there are two metrics that I find useful: # of documents per shard; and # of documents per node.

The # of documents per shard gives us a sense of how long will a search on a shard take. And the # of documents per node gives us an idea of the memory requirements of each node.

To understand how these metrics change with different values of our parameters let’s assume that we have 10,000,000 documents in the index. If we have 5 shards and 2 replicas, each shard will roughly have 2,000,000 documents in it, and in total there will be 3 copies of each shard (1 primary and 2 replicas). This means that there are effectively 15 (5 * (1+2)) shards in the cluster, so if our cluster has 5 nodes, each of them will have an average of 3 shards in it (across primary and replica shards), or in other words, each node will handle search across some 6,000,000 documents.

Let’s explore how this looks on different scenarios:

Table 1: Docs per node for different settings of # of shards and replicas

On Table 1 we can see these two metrics: documents per shard, and documents per node (counting rows in primary and replica shards) for different values of the parameters. Interesting things to note are:

  • Changing the number of shards does not change how many documents a node effectively manages, but it does change the number of documents per shard. So changing the number of shards allows us to trade off search response time with search concurrency
  • Changing the replication factor does affect the number of documents a node has to manage. So, changing the number of replicas allows us to trade off between resilience and memory requirements on the nodes
  • Increasing the number of nodes in the cluster means that each node manages less documents (horizontal scalability of the index, provided there are enough shards to give something to each node)

Scaling on search time and throughput

To understand scalability in terms of search time and throughput, there are a couple of metrics that are relevant: shard search time and # of concurrent searches.

Shard Search Time: is the time that each node takes to perform a search in one shard. These are the searches that are performed by the Search Thread Pool. The time this takes depends on many variables, but most importantly it depends on both the shard size (the more documents the slower), and the type of search, so a term search will roughly have O(n) growth, where n is the number of documents, while a filter, may be closer to O(log(n)).

# of Concurrent Searches: is how many searches are there in a given node at a given time.

Table 2: Full response time model and how it‘s affected by the different variables

This table contains a model I developed to exemplify the behaviour of Elasticsearch and how it scales. You can play with this model yourself here. Save a copy to your Google Drive and start playing with the data in the blue rows.

Let me explain what each row is:


  • # of documents: How many documents are there in the index
  • # of shards: # of shards in the Index
  • # of replicas: # of replicas in the Index
  • # of cluster nodes: How many nodes make up the cluster
  • searches/second: The throughput, how many searches are we performing per second
  • search thread pool size: The size of the “Search” thread pool as configured in Elasticsearch
  • result consolidation time: This is an estimation of how long the second phase (Fetch Phase) of the distributed search process takes. Normally it’s very fast unless you have a lot of documents per search or very big ones
  • Shard search time params (indep & slope): I’m assuming that the shard search time is a linear function of the number of documents in the shard, these are the parameters for a linear model time= slope * (#docs) + indep


  • effective # of cluster nodes: One thing to keep in mind for a big cluster is how many nodes are you using. If you have 10 nodes, but only 3 shards and 1 replica, you might as well just have 6 nodes as there’s no way you’ll use more than that in any search. That’s what this number is, how many nodes can Elasticsearch effectively use
  • docs per shard: # of documents per shard
  • shard search time (ms): how long we’re estimating that searching one shard will take (see: Shard search time params input above).
    Important note: if the computer is CPU bound, this number will decrease with concurrency and can significantly degrade in performance. Keep an eye on the shard search time to figure out if this is happening by querying the _stats endpoint.
  • shards per node: average of how many shards will each node have
  • docs per node: average of how many documents each node will manage
  • shard searches / second: Over the whole cluster, how many shard searches will be performed per second
  • shard searches / node / second: On each node on the cluster, how many shard searches will be performed per second
  • shard searches / thread / second: On each search thread on each node on the cluster, how many shard searches will be performed per second
  • max searches per thread per second: Given how long it takes to do a shard search, how many searches could each thread perform in a second.
  • average buys threads: Given the throughput of searches, how many of the search threads will be busy at any given time.
  • average shard search wait time (ms): If the “average busy threads” is bigger than the number of threads, searches will back up and will have to wait. This is the average wait time of each search
  • estimated query response time (ms): Finally, this is a guestimate of the final response time of the query

It’s a rather complex system and there can be a lot of unexpected consequences, and this is the biggest takeaway:

Response times are linear until they become exponential.

Use the model to play around and understand a bit better it’s behaviour, make a copy in your Google Drive and have a play. This is by no means a replacement to proper Load & Performance Testing, but by understanding the system better you’ll be in a better place to design your tests and experiments.

More from hipages

At hipages we appreciate technology, and we’re showing it to the world in our technology blog in Medium. Feel free to reach out and visit our Github page, our Careers page (in case you also love tech ;), and, of course, our site at

hipages Engineering

Sharing hipages' love for technology with the world

Thanks to Enrico Stahn

Paolo Ragone

Written by

CTO, GM Technology for Air New Zealand, Father, Systems Eng., Mathematician, Woodworking enthusiast

hipages Engineering

Sharing hipages' love for technology with the world

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade