Troubleshooting HTTP 503 errors returned when using a Classic Load Balancer

Sumit
Tensult Blogs
Published in
5 min readFeb 18, 2019

A load balancer distributes incoming application traffic across multiple EC2 instances in multiple Availability Zones, thereby increasing the fault tolerance of your applications. Elastic Load Balancing detects unhealthy instances and routes traffic only to healthy instances.

This blog discusses the troubleshooting steps that we can perform to identify and resolve the HTTP 503 service unavailable errors in the load balancer access logs, CloudWatch metrics, or while connecting to your Load Balancer DNS name using a browser. The most common reason for the ‘503 service unavailable’ error being returned is the absence of any registered backend instances in any of the availability zones that your load balancer is configured in, or the registered backend instances are failing health checks.

Let’s see where you can verify the healthy host count currently configured in your load balancer.

Step 1.

In the EC2 dashboard of your AWS account, go to the option named ‘Target Groups’ under the side-menu ‘Load Balancing’.

EC2 dashboard

Step 2.

Once the target group dashboard opens, select the particular target group from the list (marked as 1 in the reference image below). Then click on the Monitoring tab below and lastly, click on the graph which says Healthy Hosts.

Target Groups dashboard

Step 3.

Make sure the statistic set for the healthy host graph is set to ‘minimum’. You will see a graph similar to the below image with no data-points, if there are no registered backend instances.

Graph with no registered backend instances

Additionally, if you have registered instances and you see the data-points still at zero like shown in the below image, then we can confirm that the registered backend instances are failing health-checks.

Graph received while the instances fail health-checks

I would like to mention here, that the most common reason for health-check failures is that the load balancer did not receive a 200 response code from the backend in response to the health check target page request. If the backend instance responds with a non-200 response code, then the health-checks fail and the load balancer marks the instance as unhealthy.

For layer 7 listeners, the load balancer expects an HTTP 200 OK response, in order to pass the health-check. For layer 4 listeners, the load balancer marks an instance as healthy, after the TCP-handshake has been completed.

You can determine status code returned by the backend instance by performing a curl or net-cat operation on the health check target page from an instance in the same subnet as that of the load balancer or any other host machine if the security groups and network ACLs are open. Run the below command from such an instance and analyze the output :

curl -vo /dev/null http://IP address/index.html

Here, the IP of the backend instance referred could be it’s private IP or public IP depending upon the setup, and the health-check URL is index.html. Below is a sample output of this command. The instance has responded with a non-200 response code confirming that the health checks are failing.

The workaround for this issue is to make sure that the health check target page exists at the location and the backend instance responds to the health check target page with HTTP 200 OK response.

For TCP health check, we can run the below command:

$ nc -zv IP address:80

If the backend instance responds anything other than a successful TCP connection, such as ‘connection timeout’ or ‘connection refused’, you can confirm that the TCP connection has failed and therefore, the health checks are failing.

The workaround for this issue is to fix the firewall configuration on your backend instance if it’s blocking access and make sure that the service is running and listening on the correct port and IP address. Below is a sample output which we receive when the TCP connection is successful.

Successful TCP connection

Another issue that you may face is that of connection timeout. If the connection to the backend instance times out for the health check target page request, it means the backend instance did not respond to the health check request within the currently configured response timeout period. The workaround for this issue is to modify the response timeout value in the load balancer health check configuration which would allow the health check to be completed.

Yet another reason for HTTP 503 error returned is that the backend instances are under significant load or have reached the connection limit. You can determine the CPU utilization of your backend instances by going to

CloudWatch > Metrics > EC2 > Per-instance metrics > CPU utilization of a particular instance

If you see that there is a consistent high CPU utilization for an instance, this means that the backend instance is at capacity and the connections have reached a maximum value leading to request failure. This would also result in search queue length being built-up which means that the instances are unable to process the incoming requests as fast as the requests are being received.

In addition to checking CPU utilization, there could be possible memory or network issues if the backend instances are too small to handle the incoming traffic. We can resolve this issue by scaling out the instances or by scaling up the instances to a higher instance type that can meet your workloads. Resizing the instances to a larger instance type should resolve this issue.

--

--