Dynamic DNS Resolution in Nginx

By: Yongjie Lim

The SRE team here at TrueCar recently wrapped up a major refactor of our application routing infrastructure. By moving off AWS Classic Load Balancers and leveraging AWS Application Load Balancers (ALB) with path-based routing, we were able to deprecate the use of Consul for service discovery which reduced operational complexity and risk. In addition, ALBs support modern web technologies like WebSockets and HTTP/2 and allow TrueCar to deliver a fast and great user experience to our consumers and dealers. 
We will be writing more about our infrastructure in the future and covering this in detail!

One of the changes in our new routing setup required forwarding incoming requests to a load balancer address via Nginx’s proxy_pass functionality.

server {
listen 80;
server_name *.example.com;
location / {
proxy_pass https://internal-balancer-loader-2000-1525531520.us-west-2.elb.amazonaws.com;
}
# Additional logic
}

Which worked great!

[25/Apr/2019:07:17:57 +0000] “GET / HTTP/1.1” 200 10.0.158.12:443
[25/Apr/2019:07:18:26 +0000] “GET / HTTP/1.1” 200 10.0.172.157:443
[25/Apr/2019:07:18:27 +0000] “GET / HTTP/1.1” 200 10.0.158.12:443
[25/Apr/2019:07:18:27 +0000] “GET / HTTP/1.1” 200 10.0.172.157:443
[25/Apr/2019:07:18:28 +0000] “GET / HTTP/1.1” 200 10.0.158.12:443

…until we started seeing timeout errors like this one

2019/04/25 07:53:13 [error] 93#93: *68 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.17.0.1, server: _, request: “GET / HTTP/1.1”, upstream: “https://10.0.172.157:443/", host: “localhost:32769”
[25/Apr/2019:17:06:36 +0000] “GET / HTTP/1.1” 504 10.0.158.12:443, 10.0.172.157:443

What Happened?

When you set up a website served behind AWS load balancers, you create a CNAME record to the load balancer’s DNS entry, as the underlying IPs of the load balancer nodes are subject to change. By default, the load balancer DNS records have a time-to-live (TTL) of 60 seconds; past that, there’s no guarantee that the resolved IPs will still be valid. And unfortunately for us, Nginx resolves the address defined in the proxy_pass directive just once — on startup. While the underlying IPs change for the load balancer, Nginx continues to use the IPs of whatever the load balancer DNS originally resolved to when Nginx started up.

We’ve Got a Fix!

To solve this, we made two changes to our Nginx configuration:

First we added a resolver directive, pointing to the AWS or self-managed DNS server in your VPC. The AWS DNS server resides at the base of your VPC’s IPv4 network range, plus two.

location / {
resolver 10.0.0.2;

proxy_pass https://internal-balancer-loader-2000-1525531520.us-west-2.elb.amazonaws.com;
}

The resolver directive honors the TTL of the load balancer DNS, ensuring that the resolved record is always up-to-date.

Second, to get around the “only resolve on startup” issue, we changed the proxy_pass directive to use a variable for the DNS value:

location / {
resolver 10.0.0.2;
set $elb_dns internal-balancer-loader-2000–1525531520.us-west-2.elb.amazonaws.com;
proxy_pass $elb_dns;
}

Nginx evaluates the value of the variable per-request, instead of just once at startup. By setting the address as a variable and using the variable in the proxy_pass directive, we force Nginx to resolve the correct load balancer address on every request.

[25/Apr/2019:08:50:11 +0000] “GET / HTTP/1.1” 200 10.0.158.12:443
[25/Apr/2019:08:50:13 +0000] “GET / HTTP/1.1” 200 10.0.190.181:443
[25/Apr/2019:08:50:15 +0000] “GET / HTTP/1.1” 200 10.0.190.181:443
[25/Apr/2019:08:50:17 +0000] “GET / HTTP/1.1” 200 10.0.158.12:443
[25/Apr/2019:08:50:20 +0000] “GET / HTTP/1.1” 200 10.0.190.181:443

And that’s it! We hope that the tips above will help some of you with configuring Nginx to proxy_pass to dynamic DNS addresses.