Splunk Operator for Kubernetes (SOK) — Improvements on the indexing tier

Gareth Anderson
13 min read1 day ago

--

Kubernetes official logo with a plus sign and the Splunk official logo

What was our goal in moving from bare metal to Kubernetes?

What were the results?

Indexing tier — how can we measure performance?

Search head level — how can we measure performance?

Performance comparisons

Indexer performance — primary cluster

Indexer performance — primary cluster — Legacy to K8s summary

Indexer performance — ES cluster

Indexer performance — ES cluster — Legacy to K8s summary

Indexer performance — other cluster

Indexer performance — other cluster — K8s summary

Conclusions

What was our goal in moving from bare metal to Kubernetes?

After running bare metal indexer clusters for a number of years we had “large” machines (96 logical processors, 192GB RAM), however we rarely exceeded 30% CPU usage.

At Splunk level, we could add more parallelIngestionPipelines to improve ingestion performance or adjust the batch_search_max_pipeline for better search performance (specifically for batch searches). However, neither of these settings appeared to make much difference.
The parallelIngestionPipelines option has diminishing returns above 2 and we saw minimal benefit to increasing this on the indexing tier. The batch search setting only affects batch searches, which represented a fraction of the overall search workload.

Increasing ingestion volume simply resulted in the queues filling on the indexers, even though the hardware seemed under-utilised at all levels (CPU, memory and I/O).

Moving to Kubernetes (K8s), and therefore the Splunk Operator for Kubernetes (SOK) was an attempt to improve utilisation of our hardware by running multiple indexers (or K8s pods) on each bare metal server.

I have written another article — Splunk Operator for Kubernetes (SOK) — Lessons from our implementation about what we have learnt when implementing SOK, this article will focus on the indexing tier improvements.

What were the results?

The results varied depending on the search workload involved.

With an extremely heavy search workload we saw a 30% increase in ingestion per server with no decrease in search performance (search performance appeared to be better but it is difficult to measure), we were able to run 2 pods per server in this scenario.

On the indexers with a lower search workload we saw 50% or more increase in ingestion per server. One cluster approached 1TB/day of ingestion per machine and managed to not completely fill the indexing queues, there was a small impact to search performance. In this scenario we were able to run 4 pods/server.

While we tested 4 pods/server on the heavier workload, and 8 pods/server on the lighter workload, neither experiment ended well.

Indexing tier — how can we measure performance?

The first challenge is, how do you measure the Splunk indexing tier’s performance?

I came up with this list, and I welcome feedback in the comments or on Splunk community slack:

  • indexing queues fill % (TCP input and replication queues in particular)
  • GB per day of data ingestion per indexer
  • Searches per day for each indexer
  • OS level and K8s pods performance (CPU/memory/IO stats)
  • Events/second benchmark

When using SmartStore we also checked if we are comparing with or without the cache full. When the cache is not at capacity there are less evictions (deletions), if we combine this with minimal SmartStore downloads there will be a lighter I/O workload on the server.

The Alerts for Splunk Admins app from SplunkBase contains the dashboards and reports that are mentioned within this presentation.

In particular the dashboard indexer_max_data_queue_sizes_by_name was used for many of the screenshots, along with splunk_introspection_io_stats to check I/O level stats.
The report IndexerLevel — events per second benchmark was used to approximate events/second coming back from the indexing tier without including any search head level overheads.

Search head level — how can we measure performance?

This criteria is my initial attempt to measure the indexer performance from a search point of view. Again, feedback welcome in the comments or on Splunk community slack:

  • Find searches that have not changed in the past 90 days
  • Filter out those using multisearch, append, join/subsearches
  • Use index=_introspection sourcetype=search_telemetry to determine indexer execution times using phases.phase_0.elapsed_time_aggregations.avg
  • Further narrow down to indexes with a semi-consistent ingestion volume

I also created the report IndexerLevel — savedsearches by indexer execution time to help find searches with this criteria.

In retrospect, the filtering may not have been required, subsearches do not have their performance recorded in the introspection data at the time of writing. Additionally, I am unsure if the phase0 statistics from search_telemetry were different when using the multisearch search command, or if the statistics change when using subsearches.

Once I had enough searches matching the criteria I then built a dashboard to “compare” as we moved indexers into K8s to see if performance improved or degraded.

Performance comparisons

There were three unique indexer cluster setups, each with a distinct Splunk search and ingestion workload.

The primary and the ES (or security) cluster existed prior to the K8s project, the results of pre and post-migration to K8s are provided. The “other” cluster was built on K8s and therefore there is nothing to compare to.

A comparison was also run in terms of attempting to run “more” pods per node than the initial setup (2–4 pods per node). This was inspired by the HPE, Intel, and Splunk Partner to Turbocharge Splunk Applications article where they were able to run 12 pods per node.
While I was skeptical that our search workload would work well with more than 4 pods per node, I did quickly find the limits of pods per node with our current hardware.

The sections below detail the measurements from the various indexer clusters.

Indexer performance — primary cluster

Performance prior to K8s

This section summarises the workload of the “main” indexer cluster while running on bare metal servers.

Splunk profile

  • 1.2 — 1.4 million searches/day
  • Ingestion of 270GB — 350GB/day/indexer
  • Indexing queues sometimes blocked, replication delays of up-to 15 seconds
  • Data ingestion delays of average of 20-30 seconds (HF tier -> Indexers)

Hardware setup

  • 96 logical processors — Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
  • 384GB RAM
  • 28TB cache per-node (RAID 0 NVMe)
  • CPU trend of 20 — 45%
  • I/O trend of trend of 1500 — 3000 IOPS
There are spikes to 100% but only for 1 minute, the majority of the time the queue is empty or close to empty on all indexers
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
The replication queue spikes to 10 at most around 13:00, the replication duration never exceeds 2, the replication queue count is mostly 0 through the day
Replication queues, 9AM — 6PM

Performance on K8s

This section summarises the workload of the “main” cluster built for K8s, it has the same configuration as the primary cluster but a newer generation of hardware with an equal amount of logical processors, memory and total disk space.

This setup has 2 K8s pods per node/bare metal server.

Splunk profile

  • 1.2 — 1.4 million searches/day
  • Ingestion of 215GB — 250GB/pod/day, 500GB/day/server
  • Close to zero indexing queue fill, no replication queue issues
  • Data ingestion delays of average of 13 seconds (HF tier -> Indexers)

Hardware setup

  • 96 logical processors — Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz CPU
  • 384GB RAM per server
  • 44 logical processors/168GB RAM per pod
  • 11.2TB cache per-pod (RAID 0 NVMe)
  • CPU trend of 20 — 35% (OS level)
  • I/O trend of 800 — 1000 IOPS (OS level)

Note that the SmartStore cache was not full. Additionally the newer generation CPU has a slower clock speed but has improved performance.

Indexing queue fill size showing close to zero fill measured with maximum
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
No results found in terms of replication queue fill size
Replication queues, 9AM — 6PM

Summary — Legacy (prior to K8s) compared to K8s

K8s had a Splunk level data ingestion of 250GB/pod/day or 500GB/day/server
Non-K8s had a Splunk level data ingestion of 330GB/day.

The results on K8s were an approximately 50% increase in ingestion data, there was also less indexing queue issues on the K8s cluster.
This K8s hardware has a newer CPU spec and newer NVMe disks so this is not a fair comparison.

SmartStore downloads did block the indexing/replication queues, this occurred on both K8s and bare metal.

Performance on K8s with 2 pods down

This setup is identical but we had 1 node down (2 pods) resulting in a higher workload for the remaining pods.
This comparison is useful as we’re closer to the “upper limits” of what can be done with this search/ingestion workload.

The ingestion per day per pod was close to 270GB/indexer pod (or 540GB/server/day).

Indexing queue measurement using max showing blocks of time at 100% with most of the time semi-filled or no fill
Indexing queues, 9AM — 6PM , 10 minute blocks, max fill %, 2 pods/down

Queues were measured using the maximum value (max), the pods actually performed very well. Minimal difference was found in search / indexing performance.

Replication duration of upto 3 seconds, upto 4 replication issues at a point in time
Replication queues, 9AM — 6PM, 2 pods down

Summary — K8s with 2 pods down

No issues were noticed excluding SmartStore downloads blocking indexer and replication queues.
Searches manually measured appeared to be 5–10% slower, however this is not objective enough for a conclusion.

  • CPU trend of 40 — 50% (OS level)
  • I/O trend of 1500 — 3000 IOPS (OS level)
    SmartStore cache was not filled, SmartStore had active downloads in this test.

After the SmartStore cache was filled, minimal difference was found in the months after these screenshots. Heavy SmartStore downloads did block the indexing queues, however, the downloads appear to have less impact since Splunk version 9.1.3 / SOK 2.5.0.

Performance on K8s with 4 pods/server

This is the same hardware setup as described previously. Instead of running 2 pods on a node, we tested running 4 pods on a node to attempt to further utilise the hardware we had available.

Splunk profile

  • 1.2 — 1.4 million searches/day
  • Ingestion of 144GB/pod/day, 576GB/day/server
  • Indexing queue heavily filled
  • Data ingestion delays of average of 76 seconds (HF tier -> Indexers)

Hardware summary

  • 96 logical processors — Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz CPU
  • 384GB RAM per server
  • 22 logical processors/90GB RAM per pod
  • 5.6TB cache per-pod
  • CPU trend of 50 — 98% (potential CPU throttling due to heat issues)
  • I/O trend of 1000 — 3000 IOPS
    SmartStore cache was not filled, ingestion delay spikes were much higher than previously measured.
Parts of the diagram show a maximum measured fill of 100% later in the day and this 100% stays for a number of minutes on multiple servers. The morning wasn’t as much of an issue
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %

Summary — Legacy to K8s — all scenarios

Legacy or prior to K8s achieved 330GB/server/day.
K8s with 2 pods/server resulted in 500GB/server/day — 50% more data per server than legacy.
K8s with 2 pods/server, with 2 pods down in the cluster, 540GB/server/day— 63% more data/server than legacy.
K8s with 4 pods per server resulted in 576GB/server/day — 75% more data/server than legacy.

4 pods per server did not work well in terms of ingestion, search performance was likely degraded as well.

The 2 pods per server appears to be the preferred setup for this hardware/search workload combination and did not result in any measurable decrease in search performance.

Performance on K8s — newer vs older generation hardware — 2 pods per server

The indexer clusters for K8s had a mix of older and newer hardware as nodes, therefore it was possible to directly compare the differences for an identically configured indexer cluster.

Splunk profile

  • 1.4 million searches per day
  • Newer hardware — 235GB/day/pod, 470 GB/day/server
  • Older hardware — 210GB/day/pod, 420 GB/day/server
  • Close to zero indexing queue fill, minimal replication queue issues
  • Data ingestion delays of average of 13 seconds (HF tier -> Indexers)

Hardware summary

  • 96 logical processors
  • Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz CPU — newer hardware
  • Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz — older hardware
  • 384GB RAM per server
  • 44 logical processors/168GB RAM per pod
  • CPU trend of 25 — 50% (OS level) — newer hardware
  • CPU trend of 35 — 55% (OS level) — older hardware
  • I/O trend of 1000 — 3000 IOPS (OS level) — both types of hardware
  • 2 disks (RAID 0 NVMe) — newer hardware
  • 4 disks (RAID 0 NVMe) — older hardware
  • 11.2TB of cache per-pod
  • SmartStore cache not filled

Summary — K8s newer vs older generation hardware

The newer hardware utilised less CPU for a workload with more data, additionally the search performance was slightly faster on the new hardware.

Indexer performance — ES cluster

The indexer cluster that hosts security related indexers had identical hardware to the primary cluster, however it had a different search workload.

Prior to K8s

This section summarises the workload while running on bare metal.

Splunk profile

  • 350K searches/day
  • Ingestion of 190GB — 200GB/day
  • Indexing queues mostly free, close to zero replication queue issues

Hardware summary

  • 96 logical processors
  • 384GB RAM
  • CPU trend of less than 10%
  • I/O trend of 1500 IOPS
  • 28TB cache per-server

Performance on K8s

This section summarises the workload of the ES cluster built for K8s, the bare metal hardware mentioned above was used for the K8s nodes.
4 pods/node were configured.

Splunk profile

  • 350K searches/day
  • 125GB/pod/day, 500GB/server/day
  • A peak of 200GB/pod/day, 800GB/server/day
  • Indexing queues slightly filled, close to zero replication queue issues

Hardware summary

  • 24 logical processors/90GB RAM per pod
  • CPU trend of 20 — 35% — newer hardware
  • CPU trend of 25 — 40% — older hardware
  • I/O trend of 1500 — 3000 IOPS
  • 1 disk (no RAID) — newer hardware
  • 2 disks (RAID 0) — older hardware
  • 5.2TB cache per-node
  • SmartStore cache filled
This diagram of maximum fill percent shows minimal usage except a 100% fill around 10:45AM approximately
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
The graph is mostly empty excluding a spike to close to 40 seconds around 10:45AM, similar with the left hand side graph the count is zero excluding 10:45AM
Replication queues, 9AM — 6PM

On K8s with 4 pods/node — 16 pods down

During a patching procedure it was requested the servers come back online afterhours, so we had 4 servers down (16 pods) and this tested the limits of the pods.

The graph measures queue fill percent with maximum, most of the indexers had a large % fill, some hovering close to 100% for more than 5 minutes at a time
Indexing queues, 12:30PM — 6PM , 10 minute blocks, max fill %, 16 pods down (4 servers)
The count doesn’t go above 1 for replication issues, the duration is usually 1–2 seconds with a spike to 4 seconds for 1 issue
Replication queues, 12:30PM — 6PM, 16 pods down (4 servers)

Summary — Legacy to K8s — ES

Legacy or prior to K8s we indexed 200GB/server.
K8s with 4 pods/server achieved 500GB/server or 125GB/pod/day.
K8s with 4 pods/server during downtime achieved 720GB/server or 180GB/pod/day.

The data ingestion was increased by 2.5X per server, indexer queues appear to be not filled and there was no noticeable difference in search performance.
Even with 4 physical nodes down the impact on ingestion/search performance was minimal.

Hardware summary

CPU utilisation of 40 — 60% with spikes close to 100%.
Disk service times were slower, but no noticeable search performance difference.

phase0 response times in the search_telemetry data decreased by approximately 10% — 20% on the new hardware. Only 5 searches were sampled as this was quite a time consuming exercise.

Note that SmartStore downloads can push the servers to 100% CPU as was seen some months down the track.

Indexer performance — other cluster

Note this cluster was created after the K8s project started so there is no previous cluster to compare to.

K8s setup with 4 pods/node

Splunk profile

  • 220K searches/day
  • 125GB/pod/day, 500GB/server/day
  • 190GB/pod/day, 760GB/server/day later in the year
  • Indexing queues were lightly used, close to zero replication queue issues
  • Search pattern — 1 to 7 days lookback for most searches

Hardware summary

  • 24 logical processors/90GB RAM per pod
  • CPU trend of 15 — 25%
  • CPU trend of 15 — 45% later in the year
  • I/O trend of 1200 — 7000 IOPS
  • 5.2TB cache per-node
Indexing queue fill size measured with maximum, there is 1 block fo 100% for a few minutes on a single indexer, otherwise only spikes to 100% for a very brief time
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
The replication issues count is around 30 but the time doesn’t exceed 1 second at any point during the day.
Replication queues, 9AM — 6PM

The below graph shows the impact of SmartStore uploads/downloads on the replication queues:

There is a spike in SmartStore downloads from approximately 8:50AM to 9:00AM around 0.75K megabytes or 750MB/s
SmartStore 8AM — 12PM
There are two spikes in downloads around 11AM one above 2K MB per second and the second a bit below 4K MB per second
SmartStore 8AM — 12PM
Around 8:50–9 there are spikes in replication queue issues approaching 15, with a much larger spike around 11AM approaching 80. The replication queue time shows close to 60 on the 8:50 issue and closer to 1–2 on the later issues
Replication queues, 8AM — 12PM

K8s setup with 4 pods/node — heavy usage

This setup was identical but under a “heavier” workload in terms of the data ingested per pod per day.

Splunk profile

  • 220K searches/day
  • Ingestion of 255GB/pod/day, 1020GB/server/day

Hardware summary

  • 24 logical processors per pod
  • 90GB RAM per pod
  • 5.2TB cache per-node
In this graph some indexers show a spike to 100%, only sustained for 2 indexers for 2 minutes at most. The queues are mostly empty measured with maximum
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
Multiple spikes of between 10 and 90 with a duration between 0.1 and a some with a duration of 2
Replication queues, 9AM — 6PM

Splunk profile

  • Minimal SmartStore downloads
  • Replication queue delays of 1 — 2 seconds

Hardware summary

  • CPU trend of 15 — 45%
  • I/O trend of 900 — 8000 IOPS

K8s setup with 8 pods/node

This setup utilised the same hardware as the “other” cluster but instead of attempting to run 4 pods/node we tested running 8 pods/node in an attempt to further utilise the hardware.

Splunk profile

  • 220K searches/day
  • Ingestion of 112GB/pod/day, 896GB/server/day
  • Indexing queue heavily filled

Hardware summary

  • 2.8TB cache per-node
  • 12 logical processors/70GB RAM per pod
  • CPU trend of 36 — 99% CPU
  • I/O tend of 700 — 3000 IOPS
Queues on at least 5 indexers are at 100% for more than 20 minutes in the afternoon, in the morning there are regular spikes to 100% and points where it stays at 100% for at least 5 minutes or more
Indexing queues, 9AM — 6PM , 5 minute blocks, max fill %
Prior to 11AM there are replication issues as high as 125 with a few longer duration issues around 45 seconds at 11AM. The issues stop after this
Replication queues, 9AM — 6PM

I suspect the replication issues were actually worse during the afternoon but the graph did not reflect this.

Summary — K8s other

K8s 500GB/day/server, 4 pods/server — minimal queuing issues.
K8s 896GB/day/server, 8 pods/server — queues were blocked.
K8s 1020GB/day/server, 4 pods/server — minor queuing issues.

8 pods/server is clearly too much for this hardware / search combination.

There was a mix of the “newer” and “older” hardware within this cluster, a manual comparison of 6 searches showed that the difference in performance at the indexing tier (phase0 response times) was under 3%. I suspect this is due to the searches running over less data compared to the ES cluster.

Conclusions

More is not always better when it comes to the number of K8s indexer pods to run on a bare metal server. Heavier search workloads require more hardware for the pods to run well, thus the primary cluster has 2 pods/server and the other clusters have 4 pods/server.

In our environment moving to K8s has allowed more ingestion of data/day and resulted in better utilisation of our hardware.

In terms of whether this would work in other companies, the main question would be, do you have the appropriate hardware?
Alternatively, do you have the ability to obtain hardware that can run multiple pods/server?
Finally, are you comfortable learning and implementing K8s?

There are some additional conclusions that I have tested during the creation of the newer indexer clusters that apply to both K8s and non-K8s indexer cluster builds:

  • Cluster size matters — more buckets/cluster results in more recovery time from restarts or failures
  • Building smaller clusters, even with identical configuration, results in more cluster managers but less issues in our experience. The improvements were found to be:
  • Reduced recovery time for the cluster after an indexer restart or failure
  • Less chance of indexer by indexer restart if the search/rep factor is not met and searchable rolling restart is in use — knowledgebase link
  • Less replication queue issues
  • Finally, Cluster Managers prefer faster CPU’s. This applies to K8s and non-K8s since parts of the CM are single threaded

I have also written a related article, Splunk Operator for Kubernetes (SOK) — Lessons from our implementation

--

--

Gareth Anderson

SplunkTrust member, working as a technical lead on technologies including Splunk, Kubernetes and Linux.