How we debugged: UnresolvedAddressException issue faced in DR env.
Published in
4 min readAug 23, 2022
Background
We in Myntra, are working on creating our BCP/DR plan. More details can be found in this blog.
- BCP(Business Continuity Plan): A plan to continue business operations.
- DR(Disaster Recovery): A plan for accessing required technology and infrastructure after a disaster.
As a part of our DR plan, we were trying to set up our Order Taking Path services on to a new data center.
When we deployed one of CRM services in DR environment, we started facing one absurd recurring issue where:
- Health check API was working fine.
- Since the health check API was working fine, it made deployment success and brought service into LB.
- But on hitting any service endpoint, we were getting below response and nothing was available in any logs to identify issue:
Debugging steps
We have captured below the steps taken to debug the issue and solved the issue by identifying it:
Step 1
- Since no logs were available about the exception.
- We re-started service in debug mode and also added debug logs for org.apache and java.nio packages.
- Still no logs were available to find the root cause of the issue.
Step 2
- We have a logger added at the first line of the method executed by the api and since this log was not getting printed, we thought that the request itself was not reaching the application code.
- In our service, we use commons-logging/ELK which has logging related in/out interceptors.
- These interceptors formats request/response payload and logs them.
- Since LCS service was not present in DR env, we thought maybe this UnresolvedAddressException is happening because of this.
- We tried pointing to the LCS url: http://xxx.myntra.com to localhost and deployed service.
- We still faced the same issue which ruled out the LCS issue.
Step 3
- Then we tried to understand more on what does actually UnresolvedAddressException mean.
- We found out that UnresolvedAddressException is an unchecked exception and is part of RunTimeException, which means chances of logging such exception is less until unless some specific code catches exception itself.
Step 4
- Since the health check api was working fine.
- We tried to compare jaxrs server definition of health check api vs other service api to find out any anomalies.
- Most of the things were common apart from CustomJAXRSInvoker which is commonly used in our jaxrs server definition.
- CustomJAXRSInvoker takes care of profiling api request.
- Upon carefully checking the code, we see code around Profiler is not under catch block so if any exception is thrown by Profiler library, it would not get caught.
- Otherwise any exception be it checked or un-checked thrown by the super.invoke method will get caught in the catch block and will print an exception log.
- Then we started looking in how does Profiler.setContext and Profiler.increment methods are working:
- Profiler.setContext was normal method which just sets context.
- Profiler.increment method actually tries to increment given metric and pushes to Statsd server.
- Statsd library uses DatagramChannelImpl to send data over socket channel, a code snippet is shown below.
- Net.checkAddress(var2) method used by DatagramChannelImpl.send method actually throws UnresolvedAddressException, if host address provided is not resolvable.
- Then we checked the host that we were passing in statsd configuration.
- We were passing http://abc1.myntra.com as host. This DNS was not present in DR env.
- A while ago, Sysops requested all services in DC1 region to move to http://abc2.myntra.com
- Looks like it was missed in our service.
Fix
- We pushed the change to move statsd host to http://abc2.myntra.com and re-deployed the service, this fixed the issue.
- We have also made a fix in CustomJAXRSInvoker to handle such unchecked exceptions.
Key Learnings
- Debugging any generic issue step by step helps rather than looking at multiple things at a time. Solve for one step then move to the next.
- Knowing request execution flow helps in correlating the issue better. In our service, we use various custom interceptors(for request logging or for profiling) before request lands to the application code. They are very easy to miss while debugging any issue.
- While using any library, it helps to understand the implementation esp. exception/failure flow. Once we know exception/failure flow, handle it accordingly in the code with proper documentation around the code.
- In this debugging process, we have also found out that we were using a sync way of pushing metrics. It’s advised to use async metric publisher so that it doesn’t block request execution threads. This has also changed.