Debugging a Strange Kubernetes & Firebase Connection Reset Issue
A deep dive into Kubernetes, Firebase, and Cloud NAT networking
Background — ECONNRESET
At Leverege, we provide an IoT platform to build and manage large scale IoT applications. From asset tracking to remote monitoring, millions of sensors communicate with our platform every day. The Leverege IoT platform is made up of microservices running on Google Kubernetes Engine (GKE) and uses a combination of Firebase, TimescaleDB, and BigQuery for data processing. In particular, we use Firebase to store realtime and state information about our devices.
Our networking problem began with occasional connection resets and timeouts on our API server as our automotive use case began to scale significantly. At first, the connection issues only manifested during times of high traffic, so we suspected a performance issue on our servers. But even after allowing the servers to scale to more replicas and running more child processes to handle requests, ECONNRESETs continued.
When we ran a load test to multiply traffic by five times the production load, the resets occurred more frequently, and at a certain point, the server could no longer process new messages as unacknowledged messages in the queue thrown back in due to retry logic started to outnumber fresh data. Instead of implementing a dead letter queue to mitigate the symptoms, we decided to fix the underlying networking issue.
Debugging Kubernetes Networking
The first thing we tried was simply choosing a bigger machine for our node pool. For cost-saving reasons, API servers previously ran on n1-standard-2, preemptible machines on GKE. As seen on the graph below, TCP throughput on n1-standard-4 is almost double the throughput on n1-standard-2. As we suspected, vertical scaling improved the performance slightly, but the ECONNRESETs persisted.
Next, Googling “Kubernetes” and “connection reset” led us to a blog post on kube-proxy subtleties. The blog describes an issue with Kubernetes services when pods are serving large files to others in the same cluster. Since our server was returning relatively large JSON responses, the issue described on the blog looked relevant.
Kubernetes Networking 101
Before summarizing the findings from this blog post, Yongkun Gui gives an excellent overview of Kubernetes networking:
- Pod-to-Pod: Inside Kubernetes, every pod gets its own IP addresses. With Container Network Interface (CNI) plugins like Weave, Calico, etc, pods can ping each other even when they are running on different hosts.
- Services: Since pods are stateless, relying on the pod IP address does not work when it dies. Instead, we use Kubernetes services (e.g. ClusterIP, NodePort, LoadBalancer), which acts as an L4 load balancer for selected pods. The underlying component in charge of updating the iptables on Kubernetes is called kube-proxy.
For healthy connections, when a client pod sends a packet to the service (dst: 192.168.0.2:80), kube-proxy invokes iptables rules to map the destination to the server pod IP (dst: 10.0.1.2:80). Likewise, when the server handles the packet and responds to the client pod (10.0.0.2), kube-proxy recognizes this and maps the source to the service (18.104.22.168:80) and forwards the packet to the client.
However, let’s say that something went wrong when the server pod returns the packet (step 3) to the service. If conntrack is out of capacity or packet is outside the TCP window, the packet will be marked INVALID, but without a corresponding iptables rule to drop the packet, it’s delivered to the client with an unknown source IP.
Now, to the client pod, this is a problem since the packet shows the server pod’s IP (10.0.1.2) and not the service IP (192.168.0.2), so it sends an RST packet to drop this connection. The server pod, on the other hand, gets this message (which is a valid pod-to-pod connection scenario) and agrees to close the connection.
This certainly looked like a promising issue for us. To see if this was indeed the cause for our ECONNRESETS, we deployed the example code linked in the blog: https://github.com/tcarmet/k8s-connection-reset
After scaling up the app in the repo to 50+ replicas to ensure pods being scheduled on different nodes and stressing it with a heavy load, we did see some
reset by peer logs. However, the reset messages from the k8s-connection-reset apps were far fewer than general ECONNRESET messages from our servers.
Chasing Kubernetes Connection Reset — Part II
Before fully abandoning this experiment, we ran into another interesting blog post by the engineering team at lastminute.com. The main difference here was to manually change the conntrack capacity and running a variation of the server to trigger invalid packets.
After ssh-ing into underlying GKE nodes, we reduced the conntrack hash table size as suggested by the blog:
sysctl -w net.netfilter.nf_conntrack_buckets=150
sysctl -w net.netfilter.nf_conntrack_max=600
To trigger the capacity error, we deployed the
boom-server from the Github repo and began to see more frequent
Conection reset by peer.
We could also see the same issue by running
sudo conntrack -S on the nodes and seeing an increasing number of
After validating the connection reset issue, we set out to implement the fix. The solution is to explicitly tell conntrack to drop invalid packets or be more liberal and not mark them as invalid.
sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
For cluster running 1.14 and below, the following DaemonSet fixes the issue by applying the conntrack patch to all nodes:
- name: startup-script
- name: STARTUP_SCRIPT
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
Because the fix was merged in newer version of Kubernetes, our solution was to simply upgrade Kubernetes to 1.16. After the upgrade, we validated that by checking if conntrack was dropping invalid packets:
$ sudo iptables -S | grep -i KUBE-FORWARD
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
The bad news?
The problem still persisted. There were no more
insert_failed but the same number of ECONNRESETS to our server.
We also found a few more blog posts suggesting changes to
NF_NAT_RANGE_PROTO_RANDOM_FULLY and using
iproute2 packages patch netfilters to address a subtle race condition (see blogs below for more detailed explanations). But the symptoms described on these posts did not seem relevant to us and decided to look elsewhere.
A reason for unexplained connection timeouts on Kubernetes/Docker
How we investigated packet loss due to a Kernel race condition
5 - 15s DNS lookups on Kubernetes?
Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying…
Perhaps we were too overzealous with the Google searches to begin with. Using APM tools, we were able to narrow down the source of timeouts and subsequent resets to Firebase operations. When we ramped up the load, we saw some unbelievable P99 durations when fetching data from Firebase (10–15m of response time according to our probes!). The problem was still network-related, but it was now much clearer that it was less likely a core Kubernetes issue.
A few months before we noticed the ECONNRESETS, our team actually noticed an issue with the firebase-admin library update causing a significant slow down in writes to the database. Moving to @firebase/database version 0.5.13 from 0.5.12 caused between 5–10,000ms delays in update time.
@firstname.lastname@example.org ref.update is very slow/unusable · Issue #2487 · firebase/firebase-js-sdk
Dismiss GitHub is home to over 50 million developers working together to host and review code, manage projects, and…
The issue was fixed by the Firebase team, but also led us to shard our single device database to 10+, sharded cluster of Firebase databases to overcome the 1,000 write operations/second. Since then, we saw some periods of high load (>85%) at 5x the production load, but ECONNRESETS persisted even during times with low Firebase load (< 25%).
Next, we used the profiler to check for bottlenecks in Firebase tasks. We expected to see some slow execution speed or pending time to indicate that we were overloading Firebase or doing inefficient lookups to download more data than we need.
firebase database:profile --project <my-awesome-project> --instance <my-awesome-db-1>
But the results showed fast read/write times with almost no pending time.
This suggested that the problem was not Firebase, but the connection to Firebase. After reading through more Stack Overflow posts and digging through the Firebase docs, we found the enable logging option.
Finally, the logs confirmed our suspicion: Firebase continuously saw unhealthy connections and closed them.
Connection created -> Websocket connecting to database -> Closing unhealthy connection after timeout -> Closing realtime connection -> Shutting down all connections -> Websocket is being closed -> Websocket connection was disconnected.
The culprit to our strange network issue turned out to be the minimum ports per VM on our Cloud NAT gateway. Initially, Cloud NAT was ruled out as an obvious issue since we erroneously assumed having Stackdriver logging on would surface any errors to the Error Reporting console. We also falsely assumed that the NAT gateway will autoscale and manage the settings on its own.
But when we clicked on the Stackdriver logs, we saw a barrage of
allocation_status: "DROPPED" to a Google IP (Firebase). Since we were using the default settings when we provisioned the NAT via Terraform, the minimum ports per VM instance were just 64 when we were attempting to make far greater connections to Firebase.
Each Cloud NAT IP allows up to 64000 ports to be opened (2¹⁶-1 available ports minus reserved IPs), and minimum ports per VM determines how many VMs the NAT gateway can support. So the default setting allows 64000/64 = 1000 VMs, but each pod inside the nodes can only use 64 TCP connections to different IP addresses. As we were nowhere close to using 1000 GKE nodes, we could adjust the minimum ports setting to a higher value.
Immediately after raising the minimum ports per VM, P99 duration dropped to 100–200ms, and ECONNRESETS disappeared 🎉🎉🎉
Previously, we were not running into this issue as we had a single Firebase database instance (one IP addresses). After sharding though, we now had n-IP addresses that each server was potentially accessing to fetch data. Our APM tool also caused us to export data to an external source, consuming more available ports.
To networking experts, our mistake may seem silly. It should have been obvious to check out the status of the NAT gateway in the first place. However, because we investigated Kubernetes and Firebase issues first, we learned a ton about networking within Kubernetes, found and fixed known connection issues using services, and became familiar with debugging tools on Firebase.
Some key takeaways:
- Kubernetes 1.14 or below has an inherent networking issue that is caused by invalid packets being forwarded to pods, causing intermittent connection resets under high load.
- Firebase has a relatively unknown option to enable debug logging to surface more information.
- StackDriver’s Error Reporting tool does not include error logs from some managed services (in our case, Cloud NAT). Users are responsible for monitoring those logs.
- Cloud NAT’s default setting for minimum ports per VM instance may be too low for high traffic applications requiring external internet access. Until Google fully integrates Firebase to GCP or provides a private connector, calls to Firebase is treated as external traffic.