Debugging k8s Issues: Intermittent Outbound TLS Issues with Linkerd

Published in

Xendit Engineering

6 min readJun 2, 2022

When you have eliminated the impossible, whatever remains, however improbable, must be the truth
— Sherlock Holmes

Software engineering is a complex craft. Every line of code has a chance of introducing a new bug to the application. When these bugs show up, software engineers debug their apps to figure out what’s causing the problem and fix it. Infrastructure engineers are no strangers to debugging. Instead of going through the app source code line by line, we go through the whole system and investigate each component piece by piece. Eventually, the problem is found and a fix is made. Just like any other tech company, Xendit has its own fair share of technological issues. Today I’ll be sharing a story of one of the many issues me and my team had solved in our infrastructure.

The Problem

Xendit was in the process of mass migrating its services over to Kubernetes. It’s not a trivial process and there were a lot of issues that were encountered along the way. One of the most common reports we received from our developers was that their services would occasionally throw the following error message: “Client network socket disconnected before secure TLS connection was established.” This was interesting because it didn’t happen that often, but it was happening across all our services. The devs also brought forward evidence showing that the issue was happening way more frequently for apps hosted in Kubernetes than in the old web server deployments. With enough evidence to show that this wasn’t just the internet being unreliable, we got on with investigating the problem.

Maybe it’s the network load?

The initial theory we had was that the services were suffering from network congestion. Kubernetes allows apps to run on the same node in a single cluster. While this allows for efficient resource utilization, it also introduces an issue where some apps starve other apps of resources when they’re under heavy load. This is commonly called the noisy neighbor problem.

Digging through our metrics, we managed to find evidence to support this theory. Our production clusters process a lot of customer traffic on a daily basis with some spikes occurring every now and then. Have we found the problem and is it time to work on the solution?

Turns out the answer is no. When digging further, we realized that the problem is also present in our staging clusters. These clusters mirror what we have in production but are not exposed to customer traffic. The cluster doesn’t receive much traffic outside of our load tests, yet the problem is present even during moments of calm.

Errors still present in our staging environment

With the evidence above, it was clear that this theory is incorrect. It was time to move on to the next theory.

Maybe this happens only to cross-region traffic?

Our services are used by customers from around the world and this introduces complications that arise from cross-region traffic. Perhaps these issues come from such cross-region traffic? Once again we looked into our observability stack to prove (or disprove) this theory.

Errors are present even in same region calls

Apparently, this wasn’t the case either. We had a look at traces sent by apps that tried to communicate with endpoints in the same region and the problem was still present there. More alarmingly, we realized that traffic originating from Kubernetes would fail frequently when communicating with apps in the old web server environment. But the opposite was not true!

At this point, it became clear that there was something wrong in our Kubernetes clusters that were causing these issues. The question then was which part of it was causing the problem? Could it have been the CNI plugin? The underlying node OS? What about its networking configuration? All possible options, but then we should be having more serious problems than this. There were a lot of places to look at and we’d have to spend a lot of time trying to figure out which one could have been the real cause. And then a teammate asks…

Maybe it’s Linkerd?

Naaah, it’s not Linkerd. It’s being used by a lot of companies out there so it must be reliable. There’s no way its Linkerd. I’ve had that installed in the clusters from my previous role and we never encountered this kind of issue. Surely it’s something else specific to Xendit’s infra like our networking configuration right? But you know what, let’s try it out anyway. Since the problem was present in staging, we tested there by simply removing the linkerd-proxy sidecar from one of our services and waiting for a few minutes.

Errors stopped after removing the linkerd-proxy sidecar

It was Linkerd! But then it could have also been a fluke. To confirm this, we removed the linkerd-proxy from three more services in staging and left them alone overnight. By the next morning, we got the same results from all of them. To my teammate: I’m sorry I doubted you!

The explanation and fix

Since mTLS is mandatory for Xendit, we can’t just remove Linkerd from our cluster. It would also be incredibly expensive in terms of engineering time to move over to a different service mesh. Fortunately, the Linkerd documentation offers the “skip-outbound-ports” annotation to help solve this issue. This annotation tells Linkerd that outbound connections destined for these ports should just bypass linkerd-proxy entirely. We lose out on TCP metrics for this kind of traffic, but it’s a necessary step to ensure the reliability of our services.

There’s traffic between two pods

Even if it’s through port 80, it’s not in plaintext http

But it’s not HTTP, even though this is a regular curl request

But why was this happening though? Unfortunately, we may never know. I opened an issue to Linkerd’s Github page, but it was closed without a deeper investigation or fix being implemented. Hopefully, the Linkerd team will have a look at this again in the future. Until then, we’ll just have to utilize this workaround to fix the problem.

Conclusions

I quoted Sherlock Holmes at the start of this post because I found it very apt for this particular issue that we encountered. Too much trust was put in Linkerd and so it wasn’t considered as a possible cause for the problem. Yet in the face of cold hard evidence (and a bit of faith in a teammate’s intuition), there’s no denying that Linkerd was the culprit in this case. As you finish reading this post and before you carry on to do other things, I’d like to impart some lessons that might help you in your own investigations:

Always have an observability stack in place (eg: Datadog, New Relic, kube-prometheus), this is incredibly important in finding the source of any problem in a complex microservice infrastructure.
Start with a hypothesis to your problem and dig through your monitoring tools to support or deny your hypothesis.
Affirm the impossibles, whatever is left no matter how improbable the hypothesis, must be the root cause of the problem.

I hope you found this blog post of mine interesting and learned something from it. Stick around though! Our team has many more stories to share and we hope you’ll come back to read them once they’re up!

Oh by the way, my team is hiring! If you’re interested in a career as an Infrastructure Engineer then drop by our careers page and apply for the position! You can also send me an email or message me on LinkedIn if you’re interested and have more questions.