Debugging a Strange Kubernetes & Firebase Connection Reset Issue

Yitaek Hwang
Jun 2, 2020 · 9 min read

A deep dive into Kubernetes, Firebase, and Cloud NAT networking

Image for post
Image for post
Image Credit: Modified from Silicon Angle

Background — ECONNRESET

At Leverege, we provide an IoT platform to build and manage large scale IoT applications. From asset tracking to remote monitoring, millions of sensors communicate with our platform every day. The Leverege IoT platform is made up of microservices running on Google Kubernetes Engine (GKE) and uses a combination of Firebase, TimescaleDB, and BigQuery for data processing. In particular, we use Firebase to store realtime and state information about our devices.

Our networking problem began with occasional connection resets and timeouts on our API server as our automotive use case began to scale significantly. At first, the connection issues only manifested during times of high traffic, so we suspected a performance issue on our servers. But even after allowing the servers to scale to more replicas and running more child processes to handle requests, ECONNRESETs continued.

Image for post
Image for post

When we ran a load test to multiply traffic by five times the production load, the resets occurred more frequently, and at a certain point, the server could no longer process new messages as unacknowledged messages in the queue thrown back in due to retry logic started to outnumber fresh data. Instead of implementing a dead letter queue to mitigate the symptoms, we decided to fix the underlying networking issue.

Debugging Kubernetes Networking

The first thing we tried was simply choosing a bigger machine for our node pool. For cost-saving reasons, API servers previously ran on n1-standard-2, preemptible machines on GKE. As seen on the graph below, TCP throughput on n1-standard-4 is almost double the throughput on n1-standard-2. As we suspected, vertical scaling improved the performance slightly, but the ECONNRESETs persisted.

Image for post
Image for post
Graph Taken from GCP Blog

Next, Googling “Kubernetes” and “connection reset” led us to a blog post on kube-proxy subtleties. The blog describes an issue with Kubernetes services when pods are serving large files to others in the same cluster. Since our server was returning relatively large JSON responses, the issue described on the blog looked relevant.

Before summarizing the findings from this blog post, Yongkun Gui gives an excellent overview of Kubernetes networking:

  • Pod-to-Pod: Inside Kubernetes, every pod gets its own IP addresses. With Container Network Interface (CNI) plugins like Weave, Calico, etc, pods can ping each other even when they are running on different hosts.
  • Services: Since pods are stateless, relying on the pod IP address does not work when it dies. Instead, we use Kubernetes services (e.g. ClusterIP, NodePort, LoadBalancer), which acts as an L4 load balancer for selected pods. The underlying component in charge of updating the iptables on Kubernetes is called kube-proxy.

For healthy connections, when a client pod sends a packet to the service (dst: 192.168.0.2:80), kube-proxy invokes iptables rules to map the destination to the server pod IP (dst: 10.0.1.2:80). Likewise, when the server handles the packet and responds to the client pod (10.0.0.2), kube-proxy recognizes this and maps the source to the service (192.169.0.2:80) and forwards the packet to the client.

Image for post
Image for post
Good Packet Flow — Diagram from Kubernetes Blog

However, let’s say that something went wrong when the server pod returns the packet (step 3) to the service. If conntrack is out of capacity or packet is outside the TCP window, the packet will be marked INVALID, but without a corresponding iptables rule to drop the packet, it’s delivered to the client with an unknown source IP.

Now, to the client pod, this is a problem since the packet shows the server pod’s IP (10.0.1.2) and not the service IP (192.168.0.2), so it sends an RST packet to drop this connection. The server pod, on the other hand, gets this message (which is a valid pod-to-pod connection scenario) and agrees to close the connection.

Image for post
Image for post
Connection Reset Flow-Diagram from Kubernetes Blog

This certainly looked like a promising issue for us. To see if this was indeed the cause for our ECONNRESETS, we deployed the example code linked in the blog: https://github.com/tcarmet/k8s-connection-reset

After scaling up the app in the repo to 50+ replicas to ensure pods being scheduled on different nodes and stressing it with a heavy load, we did see some reset by peer logs. However, the reset messages from the k8s-connection-reset apps were far fewer than general ECONNRESET messages from our servers.

Before fully abandoning this experiment, we ran into another interesting blog post by the engineering team at lastminute.com. The main difference here was to manually change the conntrack capacity and running a variation of the server to trigger invalid packets.

After ssh-ing into underlying GKE nodes, we reduced the conntrack hash table size as suggested by the blog:

sysctl -w net.netfilter.nf_conntrack_buckets=150
sysctl -w net.netfilter.nf_conntrack_max=600

To trigger the capacity error, we deployed the boom-server from the Github repo and began to see more frequent Conection reset by peer.

We could also see the same issue by running sudo conntrack -S on the nodes and seeing an increasing number of insert_failed packets.

After validating the connection reset issue, we set out to implement the fix. The solution is to explicitly tell conntrack to drop invalid packets or be more liberal and not mark them as invalid.

sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1

For cluster running 1.14 and below, the following DaemonSet fixes the issue by applying the conntrack patch to all nodes:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: startup-script
labels:
app: startup-script
spec:
template:
metadata:
labels:
app: startup-script
spec:
hostPID: true
containers:
- name: startup-script
image: gcr.io/google-containers/startup-script:v1
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#! /bin/bash

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
echo done

Because the fix was merged in newer version of Kubernetes, our solution was to simply upgrade Kubernetes to 1.16. After the upgrade, we validated that by checking if conntrack was dropping invalid packets:

$ sudo iptables -S | grep -i KUBE-FORWARD
...
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
...

The bad news?

The problem still persisted. There were no more insert_failed but the same number of ECONNRESETS to our server.

We also found a few more blog posts suggesting changes to NF_NAT_RANGE_PROTO_RANDOM_FULLY and using iproute2 packages patch netfilters to address a subtle race condition (see blogs below for more detailed explanations). But the symptoms described on these posts did not seem relevant to us and decided to look elsewhere.

Debugging Firebase

Perhaps we were too overzealous with the Google searches to begin with. Using APM tools, we were able to narrow down the source of timeouts and subsequent resets to Firebase operations. When we ramped up the load, we saw some unbelievable P99 durations when fetching data from Firebase (10–15m of response time according to our probes!). The problem was still network-related, but it was now much clearer that it was less likely a core Kubernetes issue.

Image for post
Image for post
Crazy P99 Duration to Firebase

A few months before we noticed the ECONNRESETS, our team actually noticed an issue with the firebase-admin library update causing a significant slow down in writes to the database. Moving to @firebase/database version 0.5.13 from 0.5.12 caused between 5–10,000ms delays in update time.

The issue was fixed by the Firebase team, but also led us to shard our single device database to 10+, sharded cluster of Firebase databases to overcome the 1,000 write operations/second. Since then, we saw some periods of high load (>85%) at 5x the production load, but ECONNRESETS persisted even during times with low Firebase load (< 25%).

Image for post
Image for post
Firebase Load Percentages

Next, we used the profiler to check for bottlenecks in Firebase tasks. We expected to see some slow execution speed or pending time to indicate that we were overloading Firebase or doing inefficient lookups to download more data than we need.

firebase database:profile --project <my-awesome-project> --instance <my-awesome-db-1>

But the results showed fast read/write times with almost no pending time.

Image for post
Image for post

This suggested that the problem was not Firebase, but the connection to Firebase. After reading through more Stack Overflow posts and digging through the Firebase docs, we found the enable logging option.

Finally, the logs confirmed our suspicion: Firebase continuously saw unhealthy connections and closed them.

Connection created -> Websocket connecting to database -> Closing unhealthy connection after timeout -> Closing realtime connection -> Shutting down all connections -> Websocket is being closed -> Websocket connection was disconnected.

Cloud NAT

The culprit to our strange network issue turned out to be the minimum ports per VM on our Cloud NAT gateway. Initially, Cloud NAT was ruled out as an obvious issue since we erroneously assumed having Stackdriver logging on would surface any errors to the Error Reporting console. We also falsely assumed that the NAT gateway will autoscale and manage the settings on its own.

But when we clicked on the Stackdriver logs, we saw a barrage of allocation_status: "DROPPED" to a Google IP (Firebase). Since we were using the default settings when we provisioned the NAT via Terraform, the minimum ports per VM instance were just 64 when we were attempting to make far greater connections to Firebase.

Image for post
Image for post
Cloud NAT Settings

Each Cloud NAT IP allows up to 64000 ports to be opened (2¹⁶-1 available ports minus reserved IPs), and minimum ports per VM determines how many VMs the NAT gateway can support. So the default setting allows 64000/64 = 1000 VMs, but each pod inside the nodes can only use 64 TCP connections to different IP addresses. As we were nowhere close to using 1000 GKE nodes, we could adjust the minimum ports setting to a higher value.

Immediately after raising the minimum ports per VM, P99 duration dropped to 100–200ms, and ECONNRESETS disappeared 🎉🎉🎉

Previously, we were not running into this issue as we had a single Firebase database instance (one IP addresses). After sharding though, we now had n-IP addresses that each server was potentially accessing to fetch data. Our APM tool also caused us to export data to an external source, consuming more available ports.

Conclusion

To networking experts, our mistake may seem silly. It should have been obvious to check out the status of the NAT gateway in the first place. However, because we investigated Kubernetes and Firebase issues first, we learned a ton about networking within Kubernetes, found and fixed known connection issues using services, and became familiar with debugging tools on Firebase.

Some key takeaways:

  • Kubernetes 1.14 or below has an inherent networking issue that is caused by invalid packets being forwarded to pods, causing intermittent connection resets under high load.
  • Firebase has a relatively unknown option to enable debug logging to surface more information.
  • StackDriver’s Error Reporting tool does not include error logs from some managed services (in our case, Cloud NAT). Users are responsible for monitoring those logs.
  • Cloud NAT’s default setting for minimum ports per VM instance may be too low for high traffic applications requiring external internet access. Until Google fully integrates Firebase to GCP or provides a private connector, calls to Firebase is treated as external traffic.

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Yitaek Hwang

Written by

Sr. Software Engineer at Axoni writing about cloud, DevOps, and SRE topics: https://yitaekhwang.com

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Yitaek Hwang

Written by

Sr. Software Engineer at Axoni writing about cloud, DevOps, and SRE topics: https://yitaekhwang.com

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store