Learned it the hard way: Don’t use Cilium’s default Pod CIDR

Isala Piyarisi
8 min readJun 6, 2024

--

I have been involved with eBPF for quite a while. When my lead approached me about live migrating all of our clusters from Azure CNI to Cilium CNI, I jumped at the opportunity. Even though it turned out to be one of the most demanding tasks I have ever undertaken, I enjoyed every second of it.

However, that’s a story for another time. The goal of this article is to explain one of my very own k8s.af story in the hope of saving someone a few days of hair-pulling.

Why Cilium?

Our primary objectives for choosing Cilium were:

  1. Better Network Isolation: Some of our clusters run workloads from our customers, and we needed to share and control egress traffic effectively.
  2. Transparent Encryption using WireGuard: Again, with shared clusters, we wanted to embrace a zero-trust approach.
  3. Observability: Cilium comes packed with a vast array of observability features, allowing us to monitor Kubernetes workloads without any extra instrumentation.
  4. Service Mesh Features: Cilium can provide service mesh features like retries and circuit breaking without requiring a sidecar.
  5. Efficient and Lightweight Network Stack: Better performance with lower hardware costs? Sign us up!
  6. Cluster Meshing: We wanted to future-proof our infrastructure.

After the migration, everything was smooth sailing (well mostly…). We started shipping features we built on top of Cilium and received great feedback from our customers.

And everything was good until…

The incident

It was a typical Monday morning, and as part of our daily release cycle, the SRE team promoted the latest updates for our core service to the staging environment.

However, as soon as the promotion pipeline completed, our uptime monitoring solution started triggering events, indicating that the staging environment was unreachable. The SRE team immediately began investigating the root cause of the issue.

Before diving deeper, let me explain our network architecture (simplified version of it) using a quick diagram.

Two clusters, one is public-facing, and the other is private. The public-facing cluster is exposed via a firewall. Some services in the private cluster talks to public cluster via a load balancer.

Two clusters, one is public-facing, and the other is private. The public-facing cluster is exposed via a firewall. Some services in the private cluster talk to the public cluster via a load balancer.

After initial debugging, the SRE team concluded that the issue is with the connection between the firewall and the Cluster 1’s ingress service. Since all the services within Cluster 1 were running and hitting Cluster 1’s load balancer, pods in Cluster 2 were working.

Both through Wireshark and Hubble, they confirmed that the “SYN” packets from the firewall were reaching the service, but there were no corresponding “ACK” packets being transmitted from the service.

After some unsuccessful attempts to resolve the issue, the SRE team promoted the release to production since there were few critical fixes that needed to be released as quickly as possible. Rationale behind this was an application-level change couldn’t possibly break the infrastructure, and this was an isolated event.

However, as soon as the release was promoted to production, the production load balancer also stopped responding. The SRE team quickly rolled back the changes, resolving the issue in production. Surprisingly, rolling back the changes in the staging cluster didn’t fix the issue there.

Disaster continues

Since the issue on staging remained unresolved, I was brought into the war room along with a network specialist from Azure support.

By this point, the SRE team had gathered a lot of data through extensive experimentation. They had tried deploying VMs in different subnets within our VPC, ranging from the local node pool’s subnet to remote subnets belonging to different clusters, and even in the load balancer subnet. Everything seemed to work as expected, except for traffic passing through the firewall.

They had even recreated both the firewall and load balancers from scratch, but the issue persisted.

When I joined the effort, I thoroughly reviewed all the data they had collected and I conducted some further testing by deploying different types of workloads and analyzing the Hubble and Wireshark logs, searching for any clues or missing pieces that could shed light on the root cause.

When the Azure network engineer joined us, our SRE team briefed them on all the steps they had taken so far. After analyzing the gathered data, the engineer suggested deploying a VM inside the firewall’s subnet and attempting to initiate a TCP connection to the troubled Kubernetes cluster.

Our SRE team quickly set up a temporary VM within the firewall subnet and tried telneting to the load balancer IP. During this test, we observed the same issue we had been facing: the telnet connection failed to initialize. So then they moved on to investigate if there was any problem with the network peering between the firewall subnet and load balancer subnet or node pool subnet.

Out of curiosity, I asked another SRE member to ping the server again while I monitored the Hubble logs. To my surprise, the SYN packet came in as soon as the telnet was initiated, but an “ACK” was ever returned.

This behavior seemed odd, so I asked him to open a port on the temporary VM and tried pinging it from our Kubernetes cluster. However, there was no response, and even after installing Wireshark on the temporary VM, we couldn’t see the incoming SYN packet. Interestingly, pinging any other destination except the firewall subnet worked without issues.

Puzzled, I asked them to perform the same test but using the temporary VM we had created in the node pool’s subnet, and voila! It worked.

The Root Cause

Finding this behavior strange, I interrupted the members of the main incident response team, who were checking on the peering and informed them that the issue seemed to be not with ingress but egress. This gave us the impression that it could be due to a routing table issue within the AKS-managed nodes.

So our SRE team, SSH into one of the nodes from the Kubernetes cluster and run the `ip route` command. Running that command showed a few of the routing rules added by Cilium to enable cross-node communication.

Cilium Node Routing

In a high-level overview, when running Cilium in cluster-scoped IPAM mode, a CIDR range must be provided for Cilium to assign virtual IPs to pods. By default, this CIDR range is 10.0.0.0/8.

When a new node joins the cluster, Cilium allocates a unique subnet from the given CIDR block to that node. All pods on the node receive IP addresses from this assigned subnet range.

For an example, if the CIDR range 10.0.0.0/8 is used:

  • Node A might get the subnet 10.1.0.0/16
  • Node B might get the subnet 10.4.0.0/16

To facilitate cross-node communication, Cilium sets up IP routes so that traffic can be correctly directed between nodes. When a pod with the IP 10.1.5.13 (on Node A) wants to communicate with a pod with the IP 10.4.63.38 (on Node B), the packet is sent to Node A’s network interface. From there, based on the IP routing table, the packet is routed to Node B because Node B owns the 10.4.0.0/16 subnet.

The sleeping dragon

Under normal circumstances, it works really well. However, unfortunately for us, one of the subnets assigned to a node overlapped with the firewall’s subnet range. This meant that while the SYN packets reached the pod successfully, when the pod attempted to respond, the request was forward to the node’s network interface.

From there, due to the IP routing rules on the node, the packet was then routed to another node. This occurred because the firewall’s VM IP address fell within the subnet range of the second node. However, since the second node did not have a pod with the exact IP address of the firewall’s VM, the packet was lost, disappearing into the void.

To verify this hypothesis, the SRE team ran `kubectl delete node` for the conflicting node, and as soon as it was removed, external connectivity through firewall started working again.

But why did this issue pop up almost 8 months after deploying Cilium? It all boiled down to auto-scaling. From what we observed, if a node was assigned a particular CIDR range, it wasn’t getting reused even if the node was removed. So, the Cilium operator was slowly climbing through the massive CIDR range we had given it, consuming one block at a time, until it reached the CIDR range of our firewall.

On that fateful Monday morning, as soon as the SRE team promoted the release from dev to staging, it triggered a node scale-up, which spawned a node with a conflicting CIDR range, which brought down the entire communication path.

When the pod with the IP 10.4.76.34 tries to respond to a request that came from the firewall, the response packets get rerouted to Node 19 instead of the firewall.

The fix

To fix the issue in staging, we tried updating the clusterPoolIPv4PodCIDRList to a CIDR range that didn’t conflict with any of our existing internal subnets. Even after running the Helm upgrade, nothing changed. So, we triggered a node scale-up and lucked out — the new spawned node was created with a subnet from a new CIDR range.

Running a quick workload that I had created to test cross-node communications using two DaemonSets confirmed that having two CIDR ranges did not break anything. Then the SRE team quickly whipped up a script to gracefully drain and remove the existing nodes while scaling up a new node pool until all the nodes with the bad CIDR range were fully removed. By running that script, we fully recovered our staging cluster.

After running some more tests, since our production cluster also experienced the same issue, we decided to permanently fix the issue by updating the clusterPoolIPv4PodCIDRList even though the Cilium docs warned against it. We got buy-in from the stakeholders and ran the migration.

Conclusion

Despite extensive testing, complex systems like Cilium, with nearly 2000 configurable values, can still allow misconfigurations to slip though which could lead to unexpected failures.

This incident taught us the importance of methodically troubleshooting network issues and understanding low-level networking infrastructure and skills, often taken away by cloud abstractions. After working through this issue, we unanimously agreed that even though it was a very challenging and extremely draining experience, it gave us a valuable learning opportunity.

--

--