Operator SDK - Fixing Metrics in Old Version
Hello everyone, in this post I will show you a workaround to overcome an issue in old version of Operator SDK (v0.18.2 to be precise).
We are writing our own Operators for lots of platform projects. When we start writing operators there are version v0.18.2. And we developed several operators in this version.
The issue occurs while scraping operator’s metrics when you run an operator more than one pod.
Your metrics look like this:
It looks like 2 of our pods can not serve metrics and it causes to see “connection refused” errors on prometheus.
Describing Issue - Metric Bug
When you run an operator which is created by Operator SDK (v0.18.2) there are some auto generated resources to collect metrics from our operator.
These resources are:
- ServiceMonitor (Prometheus scrape)
- Service
Let’s inspect those resources. Our operator’s name is “platform-operator”.
As you can see the ServiceMonitor resource finds operators using service which has label “name: platform-operator”.
In the Service resource we can see that it selects pods with selector “name: platform-operator”.
This means that our prometheus scraper sends requests to any pod which has “name: platform-operator” label.
I said that bug occurs when you run operator on more than one pod. But why?
To understand this, we need to take a look at the running operator pods and source code.
When you run an operator with more than one pod (lets say 3), it selects one pod as a leader. And other pods run in a waiting (to become leader) state.
Operator creates a ConfigMap to lock becoming leader. So this ensures only one pod runs as leader.
And other pods (which are not leader) print these logs:
So far so good. We understand that 2 pods are running in waiting state. But this does not explain the “connection refused” error. They are healthy pods and they are running successfully.
So let’s try to send request to metrics endpoint inside those pods.
First we send request to the leader pod:
kubectl exec -it -n platform platform-operator-6c2b12hh3 -- curl localhost:8383/metrics
We successfully got metrics.
Secondly let’s try to send the same request to the other operator pods:
Okey, we faced the same error shown in prometheus.
But why? Why are those pods refusing connections.
Let’s take a look at our operator’s source code.
In the cmd/manager/main.go
auto generated file there is a part of code which is:
We can see that operator tries to become leader and then run the code addMetrics(ctx, cfg)
. Do not forget that our operator starts to work after the line err = leader.Become(ctx, "platform-operator-lock")
.
addMetrics function is responsible of creating Service and ServiceMonitor resources:
We know that 2 pods are continuously printing the log “Not the leader. Waiting”.
So let’s take a look inside the leader.Become function.
In this function we can see there is a loop that continuously tries to select leader and here we can see our log “Not the leader. Waiting.”.
This explains why the other pods refuse connections. They are not started! They are still waiting on becoming a leader. They are in a loop at the line err = leader.Become(ctx, "platform-operator-lock")
.
Solution
There is not any workaround in the version of Operator SDK v0.18.2. So you can consider to upgrade your operator to newer version. But, like in our case, if it is not possible to update it any time soon you can implement a workaround which we did.
We thought that if only the leader pod can serve for the metric requests, somehow we may redirect all metric request to leader pod.
In this perspective we decided to add an additional label to leader pod. Also we put this label to selector in Service resource.
In order to overcome the problem mentioned we implemented the below logic:
After the addMetrics
method, we get the leader pod and label it with “leader=true” then we update the pod.
After that we update the Service by adding “leader=true” label to selector field.
In this way, we ensured metric requests goes only to the leader pod.
Thank you for reading so far.
You can find the implementation of the workaround code in this repository: