Performance issues with RDS Aurora on EKS due to CoreDNS defaults

Daniel Möller
Nov 6 · 7 min read

Quinyx’s web-based system provides support for Scheduling, Time reporting, Communication, Task Management, Budgeting and Forecasting.

At Quinyx we have a mix of PHP and Java backends where most of the traffic is served by PHP.

During 2019 Quinyx migrated all production workloads to EKS and MariaDB databases was replaced by AWS RDS Aurora to be able to handle the increased demand from our customers.

Problems:

  • #1 DNS Lookup timeouts and DNS errors
  • #2 Avg HTTP latency high during load in EKS compared to running Quinyx on EC2 instances

Problem #1: DNS Lookup timeouts

DNS errors causes 500’s towards our customers, which makes us sad.

Red evil spikes with 500 errors in production

During September 2019 some initial DNS issues were addressed in our EKS clusters:

kubectl patch deployment coredns -n kube-system --patch '{"spec":{"template":{"spec":{"volumes":[{"name":"tmp","emptyDir":{}}],"containers":[{"name":"coredns","volumeMounts":[{"name":"tmp","mountPath":"/tmp"}]}]}}}}'

This was not enough, we still had 500 errors in production related to DNS..

Ough.. this should not happen :(

After further troubleshooting the conntrack bugs was identified in the Linux kernel. This topic has been explained in detail in the following great articles:

No official EKS AMI images is as of 2019–11–06 released that addresses the issues.

The options to fix the issue was identified as the following:

  • Build our own AMI image using Kernel 4.19 with 2/3 conntrack kernel issues patched.
    - a lot of work and high risk of other failures, only fixes 2 out of 3 issues.
  • set NF_NAT_RANGE_PROTO_RANDOM_FULLY
    - implementation not crystal clear and high risk
  • Stop using Alpine Docker images and set specific dns-settings to avoid the race-condition.
    - we run alpine for a reason, this is also a hacky and unstable fix
  • Run local dns-servers on each Kubernetes Nodes
    - could work 🤔
  • Wait for EKS to release new patched node AMI’s
    - HTTP 500 errors does not wait for anyone!

Running CoreDNS on each Node was for us the best option and this is worked on in the Kubernetes project as node-local-dns.

DNS Lookup timeouts: node-local-dns to the rescue!

This is a cartoonish version of the default DNS layout in EKS.

Default EKS DNS layout

The nodes could also represent containers.

Running node-local-dns would mean we run one CoreDNS on each node and have each pod on the Nodes do DNS-lookups towards the local CoreDNS.

One CoreDNS on each Node, dns queries from pods is handled by the local CoreDNS

Our setup is now similar as described in this article:
https://aws.amazon.com/blogs/containers/eks-dns-at-scale-and-spikeiness/

Result: node-local-dns

Pros:

  • Low latency on cache hits within the same node
  • No Kernel conntrack race-condition since we now can configure the local CoreDNS to use TCP upstream with force_tcp

Cons:

  • Increased resource usage on each node
  • No HA inside each node (during node-local-dns restarts DNS queries can be lost).
  • A non-default setup, increases complexity

With node-local-dns we are now 500 free, but wait.. there is more.

Problem #2: Avg HTTP latency high during load in EKS

Increased latency when load is high, thats bad 🤔

Most of the HTTP traffic at Quinyx is hitting PHP services and uses a RDS Aurora Cluster for each service, let us focus on one service and one Aurora Cluster for this example.

This service does not use any type of connection-pools similar to hikariCP.

The app is read-heavy and the Aurora replicas scale between ~2–11 replicas on a 24h cycle.

#SQL during 24h

Aurora reader endpoint issued for read-requests from the service.
AWS RDS Replicas distribute load with DNS RR.

Example DNS request #1:

;; ANSWER SECTION:foo-example.rds.amazonaws.com. 1 IN CNAME instance-1.rds.amazonaws.com.instance-1.rds.amazonaws.com. 5 IN A 10.9.5.101

Example DNS request #2:

;; ANSWER SECTION:foo-example.rds.amazonaws.com. 1 IN CNAME instance-2.rds.amazonaws.com.instance-2.rds.amazonaws.com. 5 IN A 10.9.5.102

Cool, DNS Round-robin spreads the load across our two example-replicas with a TTL of 1 second.

This has worked fine for our service running on EC2 with nscd for local cache, the DNS was never a bottleneck.

In problem #1 we looked at the DNS setup in K8S:

Also looking at the default CoreDNS config:

{
errors
health
kubernetes cluster.local {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
proxy . /etc/resolv.conf
cache 30
}

Nothing odd, except cache 30 that means:
cache [TTL] [ZONES...]

TTL max TTL in seconds. If not specified, the maximum TTL will be used, which is 3600 for NOERROR responses and 1800 for denial of existence ones. Setting a TTL of 300: cache 300 would cache records up to 300 seconds.

Ok all good, caches up to 30 seconds, the docs continues:

If you want more control:

cache [TTL] [ZONES...] {
success CAPACITY [TTL] [MINTTL]
denial CAPACITY [TTL] [MINTTL]
prefetch AMOUNT [[DURATION] [PERCENTAGE%]]
}

TTL and ZONES as above.

success, override the settings for caching successful responses. CAPACITY indicates the maximum number of packets we cache before we start evicting (randomly). TTL overrides the cache maximum TTL. MINTTL overrides the cache minimum TTL (default 5), which can be useful to limit queries to the backend.

Hmm, minimum TTL 5s .. is that really correct?

By now all clues have been highlighted, have you figured it out yet? 🕵️‍♀️


We have demonstrated that Aurora RDS replica endpoint uses a 1 second TTL on a CNAME record and the CoreDNS default minimum TTL is 5s. We have also identified that we use 2 or more CoreDNS servers in an EKS cluster that caches DNS records for all pods.

When the swarm of 100s of pods does DNS lookup of the AWS RDS Aurora reader endpoint a cached response from a CoreDNS server is returned — a record up to 4s old.
This means in theory the max number of different replicas in DNS responses during a 5s period is equivalent to the number of CoreDNS pods running.

2 CoreDNS-pods == 2 replicas in use during a 5s window.

TL;DR:
CoreDNS-servers returns the same replicas to all nodes during a 5s window due to caching of the response

This is shown here with number of sessions on one of many AWS RDS Replicas:

High # sessions vs Low # of session for periods of time.. odd

All PODs uses same subset of replicas for a short period of time, then switches to a new set as CoreDNS cache expires. This problem is ~similar to the Thundering herd problem.

At Quinyx, the actual DNS setup was as follows:

Initial Quinyx DNS setup

The layout used 2 Quinyx DNS servers (for legacy reasons). Even though 4 CoreDNS servers was used ≤2 different records was the max diversity of reader replicas during a 5s window.

AWS does state this in their documentation:

Each time you resolve the reader endpoint, you’ll get an instance IP address that you can connect to, chosen based on the rotation distribution.

The DB connection to each read replica might not be distributed evenly in the following scenarios:

- If a client caches DNS information, you might see a discrepancy in the distribution of the connections. This occurs when the client connects to the same Aurora replica using cached connection settings. DNS caching can occur anywhere from your network layer, through the operating system, to the application container.

Problem #2: The fix

To avoid using only a subset of reader replicas and bad request distribution the DNS was reconfigured as follows:

New Quinyx DNS EKS setup

Each CoreDNS on each Node now caches the replica record 1s instead of 5s which matches the original TTL of the RDS CNAME entry.

The following addition to the node-local-dns CoreDNS fixed the problem in the new setup based off the “official” config:

amazonaws.com:53 {
errors
cache 1 {
success 9984 1
denial 9984 1
}
reload
loop
bind __PILLAR__LOCAL__DNS__ __PILLAR__DNS__SERVER__
forward . __REGION_AWS_DNS__ {
force_tcp
}
prometheus :9253
}

We set __REGION_AWS_DNS__ to the ip of our VPC DNS (eg VPC CIDR 10.3.0.0/16 has DNS @ 10.3.0.2)

Result: Modified CoreDNS config

With the modified CoreDNS config that has per-node individual cache and only storing the RDS Reader CNAME for 1s the load was evenly distributed:

SQL load on a single replica shows good distribution across time

The average HTTP response time decreased due to all available replicas beeing used with good spread:

With even distribution across reader replicas, avg HTTP response decreased

Further improvements

The above is not perfect and things we will follow up is:

  • The limit of 1024 packets per seconds per ENI interface towards the AWS DNS needs to be carefully avoided and monitored.
  • Preserve original TTL’s on .amazonaws.com
  • Allow restart of node-dns-local PODs without affecting other PODs DNS resolution.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade