The mysterious answer
TL;DR: it was DNS, it always has been and it will always be… but forever in your blind spot.
Dear reader, let me tell you a story about a riddle that drove some people (us) crazy for a non-trivial amount of time, and how the right logs led them to the solution.
Part 1: The mighty web service
Once upon a time there was a web service that was used by many other tools. Due to the importance of this service, a team of brave engineers took on the task of providing high availability by implementing a robust real-time synchronised intercontinental active-active setup with transparent location-based DNS routing (try to read that out loud).
After much effort, our brave engineers accomplished their goal. The web service had active instances both in Europe (eu-service.adidas.acme
) and US (us-service.adidas.acme
), and users could do the same operations and obtain the same results in both of them. But most importantly, users didn’t need to care, because the domain service.adidas.acme
would take them to the nearest healthy one automatically. Oh, the wonders of our time!
Part 2: Danger at the doors
But the joy of the success didn’t last long, for another team appeared with dark omens. “We experienced a strange error when connecting to your service, but we retried and it worked”, they said, and other voices followed them soon after. There was no pattern, no reason at sight, but it was clear that something wasn’t right (sorry, I needed to make at least one rhyme).
Most of the operations related to this service were relatively long (seconds or minutes) and involved several consecutive API calls. The failure would randomly appear in the middle of this process, even though the calls seemed to be well formed. Indeed, they always succeeded on retry.
After some time trying to replicate the error on purpose without success, the engineers decided to record all the network traffic and wait. When a new error appeared, they read the individual connections and found an unpleasant surprise: the client initially connected to the IP address of the European server, but then switched to the IP of the other in one of the consecutive calls.
One small detail about the robust real-time synchronised intercontinental active-active setup with transparent location-based DNS routing™ is that, when you started an operation, you had to do it fully on one instance; you couldn’t just switch in the middle. But then again, the process usually finished before the user could cross the ocean, so this was never a design concern.
Part 3: The mysterious answer
They then made a loop to query the service’s DNS domain until something changed. It didn’t take long to get this:
--------------------------------------------------------------------
[2021-10-18T23:37:39.545Z];; QUESTION SECTION:
;service.adidas.acme. IN A
;; ANSWER SECTION:
service.adidas.acme. 10 IN CNAME eu-service.adidas.acme.
eu-service.adidas.acme. 9 IN A 11.11.11.11--------------------------------------------------------------------
[2021-10-18T23:37:40.496Z] ;; QUESTION SECTION:
;service.adidas.acme. IN A
;; ANSWER SECTION:
service.adidas.acme. 9 IN CNAME us-service.adidas.acme.
us-service.adidas.acme. 59 IN A 22.22.22.22--------------------------------------------------------------------
[2021-10-18T23:37:41.429Z] ;; QUESTION SECTION:
;service.adidas.acme. IN A
;; ANSWER SECTION:
service.adidas.acme. 8 IN CNAME eu-service.adidas.acme.
eu-service.adidas.acme. 7 IN A 11.11.11.11
For hours and hours, the query got the same DNS answer, the European instance. Then, for a single moment, the answer switched to the US one. After that, answers continued as usual. But why?
According to the design, the answer to the DNS query should only change if 1) the user asks from a different geographic location or 2) the healthchecks for the nearest server fail. Since the queries were being made from the same spot, the clear reason was an unhealthy instance. However, the monitoring tools showed no signs of errors.
“The maximum resolution that we see in the Cloudwatch alarms is one minute. Could it be that the healthcheck declares the instance unhealthy for long enough to force a DNS failover, but not long enough to trigger the alarm” said a concerned and very confused engineer (me).
“Look! Seems like the healthcheck configuration is too aggressive and we may be causing ourselves a small denial of service in peak usage moments!” he added very sure about his own idea (yes, I said that).
So they deactivated the healthchecks to confirm the hypothesis, but the mysterious answer continued appearing (I was shocked). This left the team with only one “possible” scenario: that Amazon’s Route53, the most advanced dynamic DNS platform in the industry, was mistakenly understanding the location of the query from time to time (yeah, sure…).
Part 4: The ubiquitous question
In a final attempt to understand the situation, the engineers did what (now we know) they should have done way earlier: activate the Route53 query logs. And there, plain as day, they saw the following:
2021-10-18T23:37:39Z service.adidas.acme A FRA53-C1
2021-10-18T23:37:40Z service.adidas.acme A SEA19-C1
2021-10-18T23:37:50Z service.adidas.acme A HIO50-C2
2021-10-18T23:37:53Z service.adidas.acme A FRA53-C1
2021-10-18T23:38:02Z service.adidas.acme A LAX3-C4
2021-10-18T23:38:05Z service.adidas.acme A FRA53-C1
Route53 was receiving multiple queries for the service domain from a DNS resolver near Frankfurt (FRA), but then it received a couple of queries from resolvers in Seattle (SEA), Hillsboro (HIO) and Los Angeles (LAX) at the same time as the errors occurred.
A couple of absurd hypotheses later, the brave engineers looked at the default DNS nameservers of the cluster where the faulty processes were running. The primary resolver was in Germany and the secondary in US.
“Why?” they asked in tears. But they all knew in their hearts that it was another legacy configuration.
Moral of the story
It’s always DNS. For real. We all know it, but we don’t accept it in its full and ever changing extent. And it gets uglier when you have layers of resolvers, such as in a cluster with a local CoreDNS instance that by default forwards to the underlying instance’s nameservers (maybe provided by a random DHCP server in the network).
Since we were developing a server system, we focused on that side of the configuration and didn’t look at the client side DNS configuration until the very end.
Hopefully we won’t make this mistake in the future.
Acknowledgement
Huge shoutout to our colleges Julio Arenere and Andrés Calimero for their efforts, which led to the detection of the mysterious DNS answer.