Sizing Your ELK Cluster (Elasticsearch, Logstash, Kibana) for High Performance

Suresh Pawar
5 min readAug 29, 2023

--

Introduction:

Setting up a high performance ELK cluster to handle large volumes of logs and metrics in production can be a challenging task. There are many considerations around hardware specifications, cluster topology, indexing strategies and more. In this post, I’ll share some best practices based on my experience in deploying robust ELK clusters.

What is ELK Stack?

Introduction to Logging and Analytics

ELK modules

As web applications and services grow in complexity, it becomes increasingly important to collect and analyze log data and metrics from different sources. This helps organizations gain insights into system performance, debug issues, and monitor user behavior.

The ELK (Elasticsearch, Logstash, Kibana) stack is one of the most popular open-source tools for logging, monitoring, and analyzing large volumes of machine data in real-time.

Elasticsearch

Elasticsearch is the search and analytics engine at the core of the ELK stack. It is a distributed, RESTful search and analytics engine capable of dealing with petabytes of data.

Elasticsearch helps you store and search your data easily. It can scale to handle large volumes of data without any downtime. You can easily add or remove nodes from Elasticsearch clusters.

Logstash

Logstash is an open source, server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to your desired destination such as Elasticsearch.

Logstash provides functionality to collect log data from different sources such as logs files, databases, APIs etc. and then parses, transforms and enriches the data before indexing it to Elasticsearch.

Kibana

Kibana lets users visualize data with charts and graphs in Elasticsearch. It provides customizable dashboards to monitor metrics, reports, and logs for your Elasticsearch data.

With Kibana, you can quickly generate bar, line and scatter plots on your Elasticsearch data. It also enables you to search, view, and interact with the data that has been indexed into Elasticsearch by Logstash.

Benefits of ELK Stack

  • Scalable and distributed architecture for high volume data
  • Real-time search and analytics
  • Built-in visualization and dashboarding
  • Easy to set up and manage
  • Open source and free to use
  • Large community support
  • Effective for monitoring application logs, server metrics and more.

Cluster Architecture

A few node types are essential — masters (for coordination), data nodes (for storage), and ingest/processing nodes. Replicas are critical for high availability.

Understanding Usage Patterns

The first step is understanding your unique usage patterns — the volume and velocity of incoming data, types of data being indexed, anticipated growth, and most common search/visualization patterns. This will give you an idea of hardware requirements.

Some key metrics to examine are — daily data volume, number of unique fields, average document size, number of concurrent users/searches etc. Profiling existing infrastructure can help provide these baseline stats.

Hardware Sizing

For Elasticsearch, more CPU/RAM generally means better performance. However, having too little or too much can be inefficient. I typically recommend:

Also consider SSDs over HDDs for better throughput. Separate data and OS disks where possible.

For Logstash/Kibana, size based on expected throughput rather than data volume. 4–8GB RAM each usually suffices for most.

Shard Allocation

Define an optimum number of primary and replica shards based on hardware, number of nodes, and ingest rate. More shards means more parallelization but consume more resources.

Putting All Numbers Together

Here are the key numbers and specifications put together based on the information provided:

Daily log size from all sources: 70 GB

Index shards per day:

  • 70 GB daily data
  • 1 index per day with replica factor of 2, means one primary and one replica shard
  • per elastic stack documentation it says each index can hold 10–50 GB per shard
  • So each daily index will have 2 shards (70 GB / 35 GB per shard), that means each shard holds 35 GB data daily, that means we are well within our 10–50 GB per shard size per day

Say we want to have data retention period: 60 days

Total data over retention period:

  • 70 GB daily data
  • 60 day retention
  • So total data = 70 GB x 60 days = 4,200 GB

Number of shards over retention period:

  • We will have 120 shards over the period of 60 days, which holds our 4200 GB data.
  • As per elastic stack documentation 1 GB RAM can handle 20 shards
  • So 12 GB RAM can handle our 120 shards
  • With 2 data nodes, shards will be split evenly
  • So each node will have 60 shards

Specifications for each data node:

  • each data node holds 60 shards
  • per documentation RAM required per shard = 1 GB
  • So RAM required for 60 shards on a given node = 6 GB
  • per elastic stack documentation OS caching requires 50% of RAM
  • So total RAM required = 6 GB (for shards) + 6 GB (for OS cache) = 12 GB

In summary:

  • Daily data: 70 GB
  • Index shards per day: 2
  • Retention period: 60 days
  • Total data over retention: 4,200 GB
  • Shards over retention: 120
  • Shards per node: 60
  • RAM required per elastic search data node: 12 GB
the cluster specification

Additionally, indices should be moved to hot/cold tiers for query performance.

This summary provides a general framework and reference point, but the optimal specifications would depend on each individual use case and requirements. Some key points:

  • Daily data volume, retention period and RAM availability can vary widely for different implementations.
  • Additional factors like data type, indexing approach, query volume/complexity would influence shard sizing and node capacity.
  • Hot/cold tiering strategies may differ — could have multiple tiers, automatic migration rules etc.
  • Infrastructure constraints around physical servers, network bandwidth etc. also play a role.
  • Ongoing monitoring and adjustments may be needed as the cluster and data volumes evolve over time.

So in summary:

  • This acts as a starting template based on the information provided
  • But each implementation would need to validate assumptions and tune specs to their unique needs
  • Continuous monitoring and adjustments are important as usage patterns change
  • Goal is to right-size the cluster for optimal performance, scalability and cost efficiency given individual constraints

One should take these reference numbers as a guide, but ultimately define specifications tailored to their own specific data, infrastructure and operational requirements. An iterative approach considering all relevant factors is recommended.

--

--

Suresh Pawar

Architect | AI ML Generative AI | Software | Cloud | Enterprise