Tuning of ELK stack for the expected throughput

Ravi Gurram
4 min readJul 2, 2019

--

Problem Statement

The requirement for the log management platform was to collect large volumes of data for analysis, the initial requirement was for the stack to handle about 2GB of data per hour, with a month’s worth of data available for analysis at any given time. That is about 1.5 to 2TB of data, overall. This volume might not be huge by standards of Big Data but for a startup, managing this data was a huge challenge. The ELK stack was the preferred choice for harvesting, collecting, filtering and storage of this data.

  • Harvesting :- Filebeat
  • Collecting/Filtering :- Logstash
  • Storage/Indexing :- Elasticsearch

Initially the entire stack was stood up with all the basic defaults, hoping that the default configuration would suffice for our use-case.

Cluster Setup

The cloud provided used to setup our initial cluster was Digital Ocean and only the Filebeat instances would be within the customer premises.

  • Total elastic search nodes: 7
  • Pure data nodes: 2
  • Pure ingestion nodes: 0
  • Mixed data/ingestion nodes: 2
  • Mixed data/ingestion/master eligible nodes: 2
  • Pure master node: 1
  • We also use the “pure master node” for running our special scripts and instrumentation code which we will discuss later. We designate this node as out “co-ordinate node”
  • Total Logstash nodes: 1

Elasticsearch Node configuration: CentOS 7.5 x 64, 16GB, 6 VCPUs, 1TB

Logstash Node configuration: CentOS 7.5 x 64, 8GB, 4VCPUs, 150GB

Versions Elasticsearch :- 6.1.2, Logstash :- 6.0.0, Filebeat:- 6.0.1, Kibana :- 6.1.2

Filebeat is hosted on a borrowed node from another product suite with little or no control over its specifications. It is suffice to note that it is a FreeBSD Linux system with enough horsepower and the Filebeat instance was taking less than 2% of CPU and less than 0.5% memory.

Every hour an external program would generate a event file of size, anywhere between 500MB to 2GB, with anywhere from 2.5 million records to 4.6 million records. The data rate waxed and waned based on the time of the day between ~600 to 1200 EPS. The old event file would be moved into a different folder with an hour stamp. The requirement was to make the events available for analysis at the Elasticsearch in near real-time, “Filebeat” was configured to read events from the current file as data was written to it (and eventually renamed).

Logstash was setup with

  • A beats input plugin
  • A complex filter/transformation script
  • Elasticsearch output plugin with bulk indexing
  • Enabled metrics, initially writing them to a file and eventually to our metrics collection setup (more on that later)

Observation…

  1. All the nodes were barely making a dent in their CPU & Memory consumption, except for the Elasticsearch nodes which were set to lock up almost all the available memory.
  2. At the Logstash we were recording a partly 90 EPS, which was disheartening cause getting from 90 EPS to the expected 1200 EPS would be a monumental task.
  3. One of the side effects of the low EPS was that in 24 hours Filebeat would have 24 open files and processing events from all 24 files, in parallel. Filebeat opens a new Harvester for every file it detects in the target folder and eventually it picks up all 24 files. At any given point Logstash would receive events from all hours of the day and “near real-time” was a pipe dream.
  4. This is just the “ingestion performance” and the same cluster is also supposed to handle the search load.

As usual the first reaction is to add more nodes to improve EPS but throwing more resources at a problem where the existing nodes were not even consuming 30% of existing resources was not a sane thing to do.

It was a frustrating experience to have scoured google and trying out all combinations of settings, suggestions, advice regarding performance enhancing techniques in the elastic stack and yet the impact on the performance was barely noticeable. The best performance was at 110 EPS. It seemed that the next option was to try out voodoo.

One thing to note was that many posts suggest that swapping be turned off on the Elasticsearch nodes but in our scenario that was not an option, as the same nodes were also hosting a Spark cluster and one of the reasons for the choice of elastic was that it is more natively integrated with Spark, which would reduce data movement during querying data into Spark from elastic. The need was to have the Spark application take advantage of the fact that data was available on the same Elasticsearch node.

Realization

The final realization was that, to reach the goal of 1200 EPS, there were umpteen configuration parameters that would be required to be tuned. The existing documentation for each of the parameters would just say what that parameter would do but almost never able to provide the context under which its effect would make sense. The need of the hour was to understand the cause-and-effect of each and every parameter and for that a series of controlled performance tests with a step-by-step approach was required and this would take quite an effort and time. This article is an attempt to document the methodical journey and the processes/procedures that were employed to reach our goal of >1200 EPS.

The 35000 feet view of the process,

a) Setup a comprehensive metrics collection framework

b) “Tune for ingestion only”, which includes,

  • Tune stand-alone Filebeat
  • Add Logstash with no filter and tune the pipeline
  • Add filters to Logstash and tune the pipeline
  • Add Elasticsearch and tune the entire pipeline end-to-end

d) “Tune for search”

The next set of articles would cover each of these steps in detail.

BluSapphire is the first and only unified cyber defense platform with intelligent response automation. It gets rid of silos by converging network, system, and end point based multi-vector analysis. Our Platform readily integrates with existing tools to deliver comprehensive advanced cyber defense.

--

--