Nginx debugging and major issues - Solved

Kushal Singh
Urban Company – Engineering

--

We use Nginx as a reverse proxy which communicates with our backend and frontend servers. In our micro-service architecture, all the services run as docker containers deployed under ECS cluster. Nginx redirects traffic to our internal load balancer at a specific port which then forwards it to the service specific containers. Details of our deployment architecture are mentioned in a different post.

Many a times we faced issues where API requests would start taking more time, our servers would become unreachable. 5xx errors would spike on our dashboards. The same would be found from Nginx server's error logs.

In all such situations the only resort used to be to restart Nginx. And magically all issues would get fixed.

With basic debugging we found that upstream servers had been all healthy and receiving proper traffic. This encouraged us to go deeper at Nginx level and fix issues.

Following are a few scenarios we faced and action steps that were taken:

Error 1: “no live upstreams while connecting to……”

With the default configuration, Nginx behaves as a load-balancer and performs healthchecks on the upstream servers.

When for certain requests your upstream server fails to respond or times out, Nginx marks the upstream server as unreachable and stops redirecting any further traffic.

Solution


upstream sm_url {
server LOAD_BALANCER_DOMAIN_NAME: max_fails=0;}
# fail_timeout is the time interval during which if health-check fails for #max_fails times then the upstream server is considered as unhealthy. Default : 10sec, 1 respectively.

Setting these to 0 disables health-check and all traffic is always routed to the upstream servers.

Note : With any change in Nginx configuration don't forget to test the changes:

sudo nginx -t

Error 2: “upstream timed out (110: Connection timed out) while connecting to upstream,…..”

1) First of all it is important to log the upstream ip address on which connection is failing.

Change : set $upstream_addr in confs/ngnix_log.conf

Sample log config

Corresponding error log:

“2018/03/17 07:59:34 [error] 17028#17028: *77553273 upstream timed out (110: Connection timed out) while connecting to upstream, client: 42.107.148.218, server: www.urbanclap.com, request: “POST /api/v1/providers/cd/notify HTTP/1.1”, upstream: “http://x.x.x.x:80/api/v1/providers/cd/notify", host: “www.urbanclap.com"

2) Now we need to check the load balancer’s ip address which should receive above traffic.

root@app6.nginx[Prod] nslookup LOAD_BALANCER_DOMAIN_NAMEServer: 172.31.0.2Address: 172.31.0.2#53Non-authoritative answer:Name: LOAD_BALANCER_DOMAIN_NAMEAddress: 52.74.x.xName: LOAD_BALANCER_DOMAIN_NAMEAddress: 54.255.x.x

Above output shows that the upstream address 52.220.146.83 is different from the allocated ip addresses of the load balancer. This means Nginx is not redirecting traffic to the actual ip address.

After debugging, we found the following points:

a) AWS ELB keeps more than one server which maps to a load balancer. With changes in traffic it can either scale up or scale down these servers. In such cases there is change in the load-balancer's DNS -> ip mapping.

b) Nginx resolves the DNS -> ip mapping only at the time of reload/restart and for subsequent requests it reads it from cache until reload/restart.

c) We can specify resolver ie DNS server either in nginx.conf or by default it reads from /etc/resolv.conf file.

Solution

For Nginx to correctly resolve the load-balancer's domain name with the right ip address, we first tried changing Nginx Config from following sources -

Approach 1 (didn't work):

Reference1, Reference2

These changes might be effective for your config structure, but apparently they didn't work for us. Since we had already spent enough time debugging these issues, we went ahead with a very simple approach:

Approach 2 (worked):

Keep a periodic script which compares the current ip address of load balancer and the ones being used by Nginx. If they aren’t same then reload Nginx. Also, send slack notifications whenever such event happens.

#!/bin/bashPATH=/usr/sbin:/usr/bin:/sbin:/binips=`nslookup xx-xxx-xxx-xxxx.ap-southeast-1.elb.amazonaws.com | grep "Address" | awk -F': ' '{ if (NR != 1) print $2}'`echo $ips > /tmp/latesta=`cat /tmp/latest | tr " " "\n"|sort|tr "\n" " "`b=`cat /tmp/old | tr " " "\n"|sort|tr "\n" " "`if [ "$a" == "$b" ];then    echo "Same"else    echo "Different"    text='ALBs ip is changed. Old: '$b' New: '$a' Reloading Nginx.'    echo $text    curl -X POST --data-urlencode 'payload={"channel": "#alerts-devops", "username": "alb-ip-change(prod-nginx)", "text": "'"$text"'", "icon_emoji": ":rube:"}' https://hooks.slack.com/services/xxxxx/yyyyy/zzzzzz    echo $a > /tmp/old    service nginx reloadfi

Alerts from above script:

alb-ip-change(prod-nginx):
ALBs ip is changed. Old: 52.76.100.149 54.251.189.195 New: 54.251.189.195 54.254.230.33 Reloading Nginx.

Frequency of alerts/ip changes:

We get these alerts or Nginx reloads or ip address changes every 3–5 days.

This setup has worked quite well for us and so far we haven't received any similar issue !

--

--