What the ARP is going on?
We recently migrated to using an immutable infrastructure model running in Amazon Web Services. In general that’s been great and has eliminated whole classes of issues we had, but as with any big change it’s not without its issues. Sometimes you come across problems that feel more like magic than they do computing.
The Problem
When we bring online a new service during deployments it connects to Redis, an in-memory database. Sometimes, though, the service was unable to connect to Redis leaving our service unable to start. The Auto Scaling groups managing the service would automatically terminate the instance and replace it. This process added extra time to our deployments and in some cases deployments timed out waiting for the service to become healthy.
When we paused the automated process and investigated, everything looked fine other than the service failing to start. We’d restart the service and it would connect no problems. As we scaled up our development environments for new games, this became more frequent and some big questions were asked, what if this happened in production and took our games offline? It was possible and it was becoming a big risk as we pushed our new immutable deployments towards production.
The Investigation
This seemingly only happened when a new instance started, and only intermittently making it difficult to reproduce. In a lucky attempt to recreate the problem, we managed to get an instance that was unable to talk to Redis. We ran tshark
a command line tool that creates packet captures that can be analysed in Wireshark, to see why our requests timed out. The request was being sent correctly from our instance but we never received a reply, then when the instance was recreated the request could be made successfully.
Having ruled out simple network configuration issues such as subnets or security groups we turned our eye to Redis. Using tshark
again to monitor the network traffic from the Redis side we noticed something curious.
Before I go on, a brief aside about the basics of TCP/IP. When you need to make a connection to another computer, you need to first complete a TCP handshake. This consists of sending a request to communicate (SYN), the other side will then send back an acknowledgement (ACK) and then we confirm the acknowledgement (SYN/ACK), this creates a TCP connection.
But in our case we could see the SYN being sent and could see the SYN being received by the Redis instance, and the ACK being sent by the Redis instance but on the client the ACK was never received.
There are a couple of puzzling problems with this scenario. Firstly we know the Client can send data correctly to the Server because the SYN is correctly received, but when we try and send an ACK back it gets lost in the pipes. It seemed like there was something going wrong at a more fundamental level, so we dug deeper.
In Layer 2 of the OSI model theres a protocol used called the Address Resolution Protocol (ARP), which maps a MAC Address to an IP Address, this is used to physically route information around a local network. ARP Broadcasts are being sent out to request information from any machine that will listen to figure out if anybody knows the mapping between an IP address and a MAC address. These are then stored in the ARP cache, which is used to route requests and to respond to other broadcasts. The theory is this keeps everybody on the local network up to date with a quick lookup table of IP addresses.
When AWS brings a new instance online, it will assign it an IP address from an available range which, because of the way we partition our networks, allows for 254 possible IP addresses in any given subnet, with about a hundred of these already in use that leaves a pool of around 150 free IPs.
We suspected that there were some flaws in the ARP cache mechanisms that caused our traffic to get lost. It was designed for classical networks, such as data centres, where IP addresses rarely change. Whereas for us it’s not unlikely that when cycling a large amount of instances in and out it could have the same IP Address as a previous instance and the Redis machine could still hold an ARP Cache entry for that IP address. This causes data to get sent to the wrong MAC address when responding, losing the message in the pipes.
We confirmed this theory, using the command watch -n 1 -d arp -a
to watch the ARP Cache, and then crossed referenced this on the affected instances and indeed it showed that in our Redis instance we had an ARP cache entry with an incorrect MAC Address for our IP Address.
The Fix?
We were close, we know what kind of issue we had but we needed to
find a way to fix it. I stumbled across an article written by
clever.com that describes a very similar problem, and follows
onto a really interesting and detailed stack overflow post.
This scared me a bit, it becomes a problem that interacts with a
handful of OS configuration options that aren’t clear what there
purpose is without lots of research and with potentially unknown
and far reaching consequences. Its every Op’s nightmare to be woken
up to a problem you wouldn’t have a hope in hell of fixing even on a
good day.
Our first attempt, in that case, was to go with the relatively
hacky approach outlined by clever.com and introduce the cron job
that clears the ARP cache every five minutes to ensure that there
are no stale entries. It works by taking a list of every entry in
the ARP cache and forcefully flushing it.
ip neigh | awk '${print $1}' | xargs -IX arp -v -i eth0 -d X
We ran this in our development environments for several months, it
worked as intended. But it’s not a great fix, its something we
explicitly enabled just for Redis in development to limit blast
radius in case it did cause issues we didn’t expect.
Fixing It, Again!
In search for the one true fix I stumbled across another
solution. Amazon had already encountered this issue, a post made
on the AWS Forum outlined a similar issues someone had with a
MongoDB Instance. It seems a change in the Linux kernel in 2014
causes the issue by tweaking the defaults about what is retained in the cache and the default timeouts means it is much more likely to keep information inside of the ARP cache.
So by changing the thresholds for the ARP cache you force it to clear much more often, which is already a default in Amazon Linux. To do this this update/etc/sysctl.conf
with: net.ipv4.neigh.default.gc_thresh1 = 0
and restart the instance.
Having run this fix in development environments for a few months now, we’ve seen all our ARP-related issues disappear. It’s made for a much more stable system and fewer sleepless nights for our Ops team. I think the moral of the story here is there is a lot of hidden complexity layered under your feet across all of computing, each with many more moving parts than you ever think about. Sometimes these parts break and it almost always pays to step outside of your comfort zone and peer under the covers.