DNS issues in Kubernetes. Public postmortem #1

Amet Umierov
Preply Engineering Blog
4 min readMay 4, 2020

In this article, I’ll share our experience with postmortems at Preply.

Here’s an example of one of our latest incidents with DNS on production in the form of a postmortem. The article could be helpful for those who want to know more about postmortems or want to prevent DNS issues in the future.

My name is Amet Umerov and I’m a DevOps Engineer at Preply.com. Let’s get started!

A little bit about postmortem and processes at Preply

A postmortem describes a production outage or paging event including a timeline, description of user impact, root cause, action items, and lessons learned.

Seeking SRE. By David N. Blank-Edelman

On weekly dev meetings with pizza, we share information between technical teams. One of the important parts of these meetings is postmortem sharing. Sometimes it’s accompanied by an additional presentation with slides and a more detailed analysis of the incident. Of course, we have blameless culture, but we don’t clap for postmortem :) The main reason why write and present postmortems to the team is because we believe sharing knowledge can prevent problems in the future.

The individuals involved in a post mortem must feel that they can give this detailed account without fear of punishment or retribution. No finger pointing! Writing a post mortem is not a punishment — it is a learning opportunity for the entire company.

Keep CALMS & DevOps: S is for Sharing

DNS issues in Kubernetes cluster Postmortem

Date: 2020–02–28

Authors: Amet U., Andrii S., Igor K., Oleksii P.

Status: Complete

Summary: Partial DNS outage (TTM is 26 min) for some services inside the Kubernetes cluster

Impact: 15000 events dropped for services A, B, and C

Root causes: Kube-proxy didn’t successfully delete an old row from conntrack table, and some services were still routed to the nonexistent pods

E0228 20:13:53.795782       1 proxier.go:610] Failed to delete kube-system/kube-dns:dns endpoint connections, error: error deleting conntrack entries for UDP peer {100.64.0.10, 100.110.33.231}, error: conntrack command returned: ...

Trigger: Due to low load inside the Kubernetes cluster, CoreDNS-autoscaler decreased pods count from 3 to 2

Resolution: Regular deploy on the Kubernetes cluster triggered new nodes’ creation, CoreDNS-autoscaler added more pods for that and the conntrack table was rewritten automatically

Detection: Prometheus monitoring detected a high 5xx rate on services A, B and C and paged on-call

5xx rate in Kibana

Action Items

+================================+==========+=========+============+
| Action Item | Type | Owner | Task |
+================================+==========+=========+============+
| Disable autoscaler for CoreDNS | prevent | Amet U. | DEVOPS-695 |
+--------------------------------+----------+---------+------------+
| DNS cache server on prod | mitigate | Max V. | DEVOPS-665 |
+--------------------------------+----------+---------+------------+
| Conntrack usage monitoring | prevent | Amet U. | DEVOPS-674 |
+--------------------------------+----------+---------+------------+

Lessons learned

What went well:

  • Monitoring detection was quick. The reaction was rapid and organized
  • We didn’t reach any limits on nodes

What went wrong:

  • We still don’t know the real root cause, it seems to be a specific bug with conntrack
  • All action items only fixed consequences, not the root cause
  • We knew possible issues with DNS but didn’t prioritize it

Where we got lucky:

  • Another deploy has triggered CoreDNS-autoscaler, and conntrack table was rewritten
  • The issue didn’t affect all services

Timeline (EET)

+=======+==========================================================+
| Time | Action |
+=======+==========================================================+
| 22:13 | CoreDNS-autoscaler scaled-down pods 3->2 |
+-------+----------------------------------------------------------+
| 22:18 | On-call engineers started to get calls from VictorOps |
+-------+----------------------------------------------------------+
| 22:21 | On-call engineers started troubleshooting | +-------+----------------------------------------------------------+
| 22:39 | On-call engineers reverted latest release of the service |
+-------+----------------------------------------------------------+
| 22:40 | 5xx errors had gone, situation becomes stable |
+-------+----------------------------------------------------------+
  • Time to Detect: 4 min
  • Time to Engage: 21 min
  • Time to Fix: 1 min

Supporting information

  • CoreDNS logs:
I0228 20:13:53.507780       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"kube-system", Name:"coredns", UID:"2493eb55-3dc0-11ea-b3a2-02bb48f8c230", APIVersion:"apps/v1", ResourceVersion:"132690686", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set coredns-6cbb6646c9 to 2

To minimize CPU utilization, the Linux kernel uses features like conntrack. Basically, it’s a utility that contains a list of NAT records in the table. When the next packet comes from the same source pod to the same destination pod, it won’t be translated again for the CPU economy, the destination IP address will be taken from the conntrack table.

How the conntrack works

Summary

So, it was an example of one of our postmortems with some useful links, hope it will be helpful for you.

In this particular case, the information we shared can be useful for other companies. That's why we're not afraid of making mistakes and that’s why we decided to make one of our postmortems public. Here are a few more interesting public postmortems:

Subscribe to the Preply Engineering Blog for more interesting articles about engineering at Preply. Stay tuned!

--

--

No responses yet