Splunk Operator for Kubernetes (SOK) — Lessons from our implementation

11 min read1 day ago

What is the Splunk Operator for Kubernetes (K8s)?

The SOK is a Splunk-built K8s operator that simplifies getting Splunk indexer clusters, search head clusters and standalone instances (heavy forwarders/deployment servers/standalone search heads) running within Kubernetes.

Additionally the SOK is Splunk supported and therefore can be used in production environments.

If you are interested in the benefits that we managed to achieve at the indexing tier using the SOK, I have written another article — Splunk Operator for Kubernetes (SOK) — Improvements on the indexing tier.

What is a Kubernetes Operator?

Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Quoting the Custom Resources documentation:

Custom resources are extensions of the Kubernetes API.
The CustomResourceDefinition API resource allows you to define custom resources
The Kubernetes controller keeps the current state of Kubernetes objects in sync with your declared desired state
This contrasts with an imperative API, where you instruct a server what to do

Further information is available on the Kubernetes operator pattern page

Splunk Operator

I will use the term “Splunk Operator” to refer to the Splunk Operator for Kubernetes or SOK.

With the Splunk operator you request an indexer cluster as a Kubernetes custom resource, the operator creates the required Kubernetes StatefulSet, services and secrets for the said cluster.

If a pod goes down the K8s StatefulSet will re-create the pod.

If an indexer container inside a pod stops responding for long enough, K8s will restart the pod due to the failure of the liveness probe. You can tweak the probe settings through the Custom Resource yaml files.

One note at this point is that the Splunk operator creates custom resources, a command such as :

kubectl get all -A | less

Does not return custom resources. In the above example the -A is the same as — all-namespaces.

The command:

kubectl api-resources

Shows all available resources. If you would like to get a Standalone instance you would use:

kubectl get standalone

Splunk operator — the app framework

As per the AppFramework documentation, the “App Framework” is a Splunk operator feature.

Effectively a .tar.gz file of a Splunk application is uploaded into an S3 bucket.

Every 600 seconds, or in your preferred interval, the S3 bucket is checked for updates to any file within the specified location.

If updates are found they are deployed via the cluster manager, deployer or directly to the Splunk instance depending on the configured scope.

Splunk operator — challenges

The operator works well most of the time, I’ve hit a few issues in version 2.2.0 which may be fixed in a later version of the operator, some of these issues still persisted after our upgrade to 2.5.2.

Most of these issues rarely occur, I’ve listed the many issues we have found in our experience running multiple indexer clusters using the SOK in the hope that it helps others.

Challenge 1 — The non-functioning operator

The operator can get “stuck”, for example if it believes the cluster manager is pushing a new bundle out then it will continue to watch until the bundle has rolled out
If the bundle is stuck (and it can stay in that state for days), the app framework downloads and other Splunk operator activities are paused until it completes the task in question
In the above scenario restarting the cluster manager is often a quick solution, you can also restart the operator pod itself

How can you detect a stuck operator?

Check the logs of the manager container in the Splunk operator pod. Assuming you are using the namespace of splunk-operator you can run:

kubectl logs -n splunk-operator deployment/splunk-operator-controller-manager -c manager

I prefer to think of the operator as a finite state machine, it cannot move into the next state until it finishes what it is doing, and it does not appear to have timeouts for activities (in 2.2.0).

For detection, you can use the splunk-otel-collector-chart to index the logs of any pod to Splunk, including the operator pod.

An example of installing the OTEL collector using helm:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

Challenge 2 — The App Framework does not restart standalone instances

Applications that are deployed to the cluster manager (or a standalone search head) may require a restart. However, the app framework does not restart a standalone server or cluster manager (tested in version 2.5.2), so this is something you need to handle yourself
On another note, indexer-level (or cluster scoped) applications appear to work as expected, the operator triggers an apply bundle command on the cluster manager

Challenge 3 — The App Framework does not support deletion

Application deletion in version 2.5.2 was not supported by the App Framework. If you remove an application from the S3 bucket you must manually remove it from the pods (github issue)

Challenge 4 — the App Framework deploys immediately

If you want to control the timing of the restart window, then you will need to do this outside the operator in version 2.5.2. If a new file appears or an existing file is updated in the s3 bucket it will be deployed in the next window, restarts will occur irrelevant of time of day

Challenge 5 — Unexpected restarts

The operator can also trigger unexpected restarts if you do not have a strong understanding of K8s. For example, if you change the memory limits of the cluster manager custom resource, this can update the StatefulSet and re-create the pod. This is required due to a limits change at K8s level, the new dynamic resizing of limits is Alpha in K8s 1.27 so this may change in the future
Since the CM has restarted this in turn may trigger a rolling restart of the indexer cluster under some circumstances.
Furthermore, if searchable rolling restart mode is enabled on the indexing tier then the restart may proceed one indexer pod at a time as the rep/search factor won’t be met on restart of the cluster manager. This is explained in the Splunk knowledgebase article Searchable Rolling Restart Restarts Less Peers per Round than Expected

Challenge 6 — No workload management pools

Due to the Splunk containers not running systemd, the workload management pools are no longer applicable

Challenge 7 — Only one Premium app is supported

Version 2.5.2 of the Splunk operator supports Splunk ES, however ITSI is not supported in this version

Challenge 8 — The Splunk monitoring console changes indexers to “new” on pod restart

Due to the allocation of a new IP address after an indexer pod restarts, the monitoring console marks the indexer instance as “new”, even it was previously configured in the MC settings. Due to this issue the distsearch.conf file will ignore the “new” indexers until the apply button in pressed in the MC’s settings menu
To resolve this I created the report “MonitoringConsole — one or more servers require configuration automated” in Alerts for Splunk Admins to fully automate the “apply” of the K8s indexers appearing as new in the monitoring console
The report “MonitoringConsole — one or more servers require configuration” detects the issue but makes no attempt to resolve it

Challenge 9 — Server level restarts did not gracefully shutdown Splunk

This may be specific to my environment and/or Kubernetes version, but I found that when rebooting an indexer node I had an inconsistent shutdown process for the indexer pods
Indexer pod 1 would sometimes shutdown, and sometimes it would be killed
Generally any other pods would often see a kill signal rather than any shutdown occurring
To resolve this issue I created systemd unit files and an offline script to run a Splunk offline on the indexers prior to OS shutdown

Challenge 10 — Moving the cluster manager to a new node

We initially had stability issues with our new hardware and this resulted in extended downtime for the cluster manager, which we wanted to avoid
Due to our use of the sig local static storage provisioner the storage underneath the cluster manager was not replicated. We found that simply relocating the cluster manager pod to a new K8s node with new PVCs did not trigger the App Framework to deploy the clustered applications
Due to this issue a script was created that would remove and re-create the cluster manager custom object and this triggers the App Framework. While this works to restore the cluster manager, it does result in a rolling restart of the indexers, as described in challenge 5 in this article this restart may be performed one indexer at a time
If you have a replicated K8s filesystem may be able to avoid the indexer cluster restart, I have an article on options

Challenge 11 — Search heads outside the K8s cluster cannot access indexers inside the K8s cluster

This is more of a Kubernetes level challenge and I’ve written an article on how we worked around this issue to allow search heads outside Kubernetes to talk to our K8s based indexers.

Additional notes

The Splunk operator only supports indexers using SmartStore as per it’s documentation.
Splunk software versions relate to the releases of the operator, the releases are tested against particular Splunk and K8s versions. I have tested using Splunk 9.1.4 with SOK 2.5.2 and it did not work as expected, we expect a future release may result in a newer Splunk version than 9.1.3.
I have written an article about storage options for K8s pods that may be of interest if you are choosing the underlying storage for your solution.

Splunk operator — what we learnt

This one seems obvious in retrospect, but, do not co-locate the cluster manager with the node running the indexer pods. Doing so will not fail gracefully when an indexer goes down or reboots.
If possible locate the cluster manager on a K8s node with a different indexer cluster or an unrelated node.
If you need to debug the ansible startup of the pods, add extra flags to the environment:

extraEnv:
- name: ANSIBLE_EXTRA_FLAGS
  value: "-vvv"

Use an alternative port number if you cannot get an SSL-enabled port working on the default port numbers. For example we used port 9998 on the indexer pods for an SSL-enabled S2S listener due to the SOK configuring port 9997 as a non-SSL S2S listener within the pod. We let istio (our K8s load balancer) listen on port 9997 for incoming SSL-enabled S2S traffic from outside K8s
Use “defaults:” to reference the Splunk ansible defaults, this allows you to customize config files that will appear in $SPLUNK_HOME/etc/system/local/server.conf for SSL-config or other items
Use K8s secrets to store SSL files, you can then add the required volumes in the relevant custom resource yaml file to mount them in the pod. Combining this with the defaults: option you can configure the SSL files to use on startup of the pod
Configuration sometimes works differently when compared to non-K8s instances for no obvious reason, for example, “site0” for CM’s doesn’t work in the operator but works on non-K8s CM’s
Another issue was found while changing the indexercluster custom resource. Any change that requires a restart of the indexers, for example, changing K8s resource limits results in a rolling restart occurring.
A Splunk offline occurs on the first indexer as expected, however, in my combination of Kubernetes version & Splunk operator version the pod is not deleted after going offline…Indexer (pod) 1 in the cluster would go offline then come back up again with no changes.
This pattern continues in an infinite loop until the pod is manually deleted at which point the limit settings are updated on that pod, then the issue would repeat on pod 2, pod 3 et cetera.
The workaround is to delete the pod manually post-offline of the indexer for every pod in the cluster (the StatefulSet will re-create them), or a faster (and higher impact) solution is to delete the entire StatefulSet to force all indexer pods to restart with the updated configuration (the operator will re-create the StatefulSet automatically).
When scaling up indexer pod numbers, in some cases an indexer pods requires manual restarts due to the CM not allowing them to join the cluster. I did not find a reason for this, but it happened in operator version 2.5.2 and previous versions
Always set resource limits within the custom resource configuration, the Splunk operator will set default limits including an 8GB memory limit in 2.5.2 if you do not specify the limits
Random issues can occur when a new indexer pod is created when scaling up, for example the peer-apps directory fails to create for no obvious reason. Removing the PVC+pod usually resolves this issue

Kubernetes limits

There is a lot of online content about Kubernetes limits and best practices, I believe we can simply this discussion down to:

Setting the CPU request provides reserved CPU and priority
Setting the CPU limit allows a burst of CPU usage but no guarantee (unless it is equal to the request setting), it also stops excessive CPU usage through cgroups throttling
Setting the memory limit results in the OOM killer triggering if the limit is reached, the memory request is for scheduling on the appropriate node

For a deeper dive into how the memory and CPU limits work there are many online articles. In particular the Redhat cgroups deep-dive series or martin heinz blog both provide a deep dive into this content.
Other resources include Daniele Polencic’s article on CPU requests and limits in Kubernetes or this LearnK8s article on Setting the right requests and limits in Kubernetes.

Kubernetes networking

Troubleshooting network connectivity can become complicated due to the use of overlay networks. Try to avoid MTU differences on any device in the path between endpoints in the solution.

In our case the MTU on all servers was 9000 but a router between our search heads & K8s indexers was set to an MTU of 1500 resulting in packet fragmentation.

The fragmentation had an extremely high drop rate, as high as 20%, while non-fragmented packets tended to have a drop rate of 0.5–2%.
This high drop rate resulted in timeouts of 80 seconds when search heads were attempting to talk to the indexing tier.

We did not see this same issue with non-K8s instances as they used TCP and appeared to avoid the fragmentation issue, flanneld was using UDP/VXLAN and packets were fragmented.

An excellent article to further understand Kubernetes networking is Tracing the path of network traffic in Kubernetes.

Finally, we used istio as our K8s load balancer. We had the option of using the istio service mesh, however due to an issue in the Splunk operator you must disable the mTLS feature.
We chose to not use the service mesh as it provides minimal advantage within Splunk with mTLS disabled.
Later research showed it may improve load balancing between pods and services, but I have not completed this testing.

Conclusion

The Splunk operator for Kubernetes is an excellent development by Splunk. We have utilised it to achieve improvements in the indexing performance of our environment.

We are now looking at moving search heads inside the Splunk operator as well where it makes sense.

While we had various challenges, I’m sharing them to attempt to make it easier for anyone else who uses the Splunk operator for Kubernetes. On a day to day basis the vast majority of the mentioned issues are not going to occur.

If you do encounter issues or want to request enhancements, please log them on the Splunk Operator for Kubernetes github or via the Splunk support portal as appropriate.