Intermittent Connection Refused Errors After New Service Deployment in Kubernetes

Published in

The Emburse Tech Blog

4 min readOct 5, 2023

Rear of female web expert typing on computer looking at monitors while sitting indoors — Source: iStock/dima_sidelnikov

We had what appeared to be an odd connectivity issue with a new service deployed in Kubernetes. We’d see “Connection refused” errors on about 30% of the calls made to the service. Note that we have a retry configured in our ingress proxy, so the real error rate in Production was much lower. The application is using python 3.9 with Gunicorn as a WSGI HTTP Server. The following details troubleshooting steps, the solution, and how to avoid the issue. This service was previously hosted in Heroku, and did not experience this error.

We have four webhook pods and two webhook worker pods for a total of six pods. The worker pods just process jobs from a Redis cache, so there is no web interface. Hindsight is 20/20 and this was a clue that we initially missed. Around 30% failure rate … 30% of six (rounded up) equals two … we have two pods in this new service that do not have a web interface …

The Error

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111

This was very easy to reproduce, just curl any endpoint for the service, and 30% of the time you’ll receive a connection refused error. curl https://webhooks.mydomain.com

This problem existed in both test and prod environments, so it had to be a config or something different in how the traffic was handled in AWS.

Troubleshooting

Gunicorn

We started by looking at the app, hypothesizing that we had a threading or connection keepalive issue. We ruled this out by hitting the endpoint directly on the pod that was running the app using this command kubectl exec -it < webhooks pod name > -- curl http://localhost:8080

Pod logs also showed that the Gunicorn worker processes and threads were not crashing. Tweaking settings and adding an Nginx proxy did not help.

Gloo Proxy

In Kubernetes (K8s), we use Gloo Edge which is a K8s ingress controller, and API gateway. We needed to rule out an issue with Gloo Virtual Service or some proxy issue. To do this we need to hit the K8s internal service for our webhooks service.

kubectl exec -it < app-pod > -- curl http://webhooks.app.svc.cluster.local:8080

This resulted in the same 30% error rate after removing Gloo, so it’s not Gloo.

Service

It’s not Gunicorn, and it’s not Gloo. Do we have a bad K8s node or some bad pods? Probably not because other apps in the cluster are running without issue, but just in case we replaced all the K8s nodes. Same problem, so it had to be something with the service. We also scaled the webhooks pod from 4 to 10. Hmm, our error rate changed … let’s check the endpoints in K8s (kubectl get ep). Maybe we have a routing issue? At this point, a team member noticed the labels on webhooks pods and webhooks-worker pods were the same.

kubectl get pods --show-labels | grep webhooks
webhooks-86c79d5f77-99fwx       app.kubernetes.io/component=webhooks,app=webhooks
webhooks-86c79d5f77-dxlqs       app.kubernetes.io/component=webhooks,app=webhooks
webhooks-86c79d5f77-lcjsr       app.kubernetes.io/component=webhooks,app=webhooks
webhooks-86c79d5f77-vcffb       app.kubernetes.io/component=webhooks,app=webhooks
webhooks-worker-9d9667f58-hqrqt app.kubernetes.io/component=webhooks,app=webhooks
webhooks-worker-9d9667f58-n62v4 app.kubernetes.io/component=webhooks,app=webhooks

In Kubernetes, Services use labels to determine which pods are part of the service and should receive traffic. Look at the Selector: below.

kubectl describe service webhooks
Name:              webhooks
Namespace:         app
Labels:            app=webhooks
                   app.kubernetes.io/component=webhooks
Annotations:       <none>
Selector:          app.kubernetes.io/component=webhooks,app=webhooks
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.20.188.1
IPs:               172.20.188.1
Port:              http  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.0.0.129:8080,10.0.0.230:8080,10.0.0.236:8080 + 1 more...
Session Affinity:  None

That’s it! The service is sending traffic to webhooks and webhooks-worker pods. We can confirm this by listing the pods that are included in the webhooks endpoint:

kubectl get endpoints webhooks -o json | jq '.subsets[].addresses[].targetRef.name'
"webhooks-86c79d5f77-vcffb"
"webhooks-86c79d5f77-lcjsr"
"webhooks-86c79d5f77-dxlqs"
"webhooks-86c79d5f77-99fwx"
"webhooks-worker-9d9667f58-hqrqt"
"webhooks-worker-9d9667f58-n62v4"

Okay, so we need to fix the labels in the webhooks-worker deployment YAML, so they are different from webhooks.

How did this happen?

For this service with are using helm templating to generate all the YAML needed to deploy the webhooks which consist of deployments (webhooks and webhooks-worker), Gloo Virtual Service, K8s Service, Service account, cron-jobs, and secrets. The base template layout that was used includes a helper template that defines default selector labels with a generic name.

{{/*
Selector labels
*/}}
{{- define "webhooks.baseSelectorLabels" -}}
app: {{ include "webhooks.name" . }}
app.kubernetes.io/component: {{ include "webhooks.name" . }}
{{- end }}

"webhooks.name" will always be webhooks and in the deployment template, we had the following:

spec:
  template:
    metadata:
      labels:
        {{- include "webhooks.baseSelectorLabels" $ | nindent 8 }}

The fix is to make them match the deployment name, like:

 spec: 
  template:
    metadata:
      labels:
        app: {{ $deploymentName }}
        app.kubernetes.io/component: {{ $deploymentName }}

Lessons learned

We have a few lessons here.

Thoroughly review YAML created by helm templates by running
helm template webhooks -f base.values.yaml -f prod.values.yaml . > output.yaml
Focus on what is new/changed in the environment. In this case, the app worked correctly in Heroku, so we should have looked closer at the new components, which were Kubernetes configurations, service, and deployments.
Look deeper at failure rates
Review service endpoints