Survive backend redeployment with a DNS failover OkHttp client

The what

Recently, I was struggling with a problem that happens when back-end guys redeploy their web service. The problem was that after the new stack is ready, we started deleting the old stack, and my Android client application, which works continuously and perfectly in the previous 2 hours (it is an internal app, so it is used 15 hours a day without being closed), gets 503 response code constantly when trying to make calls to that service. I didn’t know why it happened, the endpoint doesn’t change, the new stack is there, and I get 200 response code if making the request with curl or Postman. The interesting part is that the Android client works normally only after 5 minutes of 503 panic… But for us, unfortunately, 5 minutes of outage is a total disaster.

The why

Luckily, my Back-end guys are awesome enough to tell what happened. The guy to blame here is keep-alive connection.

Briefly keep-alive connection explanation

Basically, when a client makes a request to the server with a Connection: keep-alive header, and the server replies with a response, which also has Connection: keep-alive header, the connection is kept alive, which means it will be reused in the future if the client make another request to the server. Of course, there is a life time for that connection, for example, 10 seconds after the response is received.

Why is keep-alive connection related to the problem?

As mentioned above, the connection is reused. Consequently, the client does not do any DNS look-up, it uses the same host IP address as previous requests. However, the new server has different IP address. Now you probably see the problem. After the old stack is deleted, the previous IP address is no longer a valid one, but the client keeps ‘harassing’ it until the connection is closed due to timeout. After that, a new connection is established and everything is back to normal. By default, OkHttp client possesses a ConnectionPool with keep-alive timeout of 5 minutes, which explains why my app got 503 for the exact same duration.

The how

In this section, I will only discuss on the client point of view. Maybe there is a way to fix that in server side or by editing AWS configuration, but let not mention it here (because I am not an expert in AWS sadly 😢).

OK, the quickest and easiest fix is not to use the keep-alive timeout. Problem solved. As they say, you don’t have to fix it if you don’t use it 😆 😆 😆

What if the app is required to get the response fast? As I said, it is an internal app, which is used by workers, employees, etc, so if the response time is faster, they also work faster, and the company gains benefits. My second attempt to solve the problem is very straight forward. If the client ever gets 503, it will close all idling connections (keep-alive connections which is not in used) and retry once (once only because I do not want the app to get in an endless loop, 503 may occur due to different reasons).

Enough theory, let dive into the code

The code explains itself quite clearly. I created an Interceptor which takes a ConnectionPool as a property inside the constructor. What connectionPool.evictAll() does is simply close all idling connections.

Where can I get the ConnectionPool to inject inside this fancy Interceptor?

Probably when I create the OkHttp client.

Above is my colleague’s awesome function to build OkHttp client. What I changed is that I define my own ConnectionPool and pass it to both the OkHttpClient.Builder(line 10) and the DNSFailOverInterceptor(line 13) after adding all other interceptors (yes, it must be the last one). In this piece of code, I used default ConnectionPool, which has keep-alive timeout of 5 minutes because I think it would not break anything in our project (obviously)😃. But if you want to, you can modify it as you want (ie. extend the timeout)

I also wrote unit tests for the interceptor to verify its logic, but I am not going to post it here. Let say it is your task if you find this piece of writing useful😃. Enjoy!

Android Developer @ Wolt, ex-Zalando

The (retired) Pub(lication) for Android & Tech, focused on Development