Debugging Microservice for Slow API Response

Jatin Kheradiya
MiQ Tech and Analytics
5 min readJul 6, 2020

In today’s world of microservice architectures, just keeping the services up is not sufficient. There needs to be proof that these services are up and running and also responding within the said SLA. The same problem happened with one of our microservices at MIQ. So, there lies a common platform to connect to all microservices and keeps track of the health of the overall system by hitting a basic health API. Not just that, there are time-limits associated too and if the service does not respond within this limit, it is assumed down. Even if the service is up, it shows false negatives of service being down. This article covers the ways how you can prove that the service is up and reduce the delay in API response.

Ingestion Service

Let us first understand an overview of the architecture of the ingestion service at MiQ. It uses an external open-source tool Streamsets Data Collector (SDC) to run the pipelines and transfer the data. Users can create the pipelines in Ingestion service directly via REST APIs or via in-house UI tool Analytics Platform which communicates with Ingestion service via Kafka based eventing. There exist two more services: dataset service and schema service. Dataset service interacts with a variety of external connectors to read/write data to/from. All our microservices are deployed in kubernetes.

Some stats about the ingestion service:

  • Runs 250+ pipelines with a variety of frequency: ranging from every 5 min to weekly.
  • Transfers roughly 4 TB Data per day ranging from files of few KBs to 20 GBs
  • Communicates with Dataset service via REST API by calling 30,000+ times per day
  • SDC communicates with Ingestion service via REST API for pipeline updates around 700 times per day and via messaging service 6000 events per day.

Monitoring Components

The service is a vertx based application exposing APIs. Our health API sends the service UP status to Prometheus every minute which is displayed on Grafana Dashboard. Our initial view of the service availability on Grafana board looked somewhat like below:

As noticeable from the graph above the service is shown to be down a lot of times in a day. Another reason was, if the service is down for a particular health API call it shows down for the whole minute until the next API is hit. So this does not look good to achieve a highly available service. Don’t worry, we figured it out and that’s what this article is about.

Request reach to service

Since the service is called by multiple users, the primary question was is the request reaching the service. As there exists a common ingress controller, which routes the API calls to microservices and the /health API which Central Monitoring calls, connects first to this ingress controller, there was a doubt if the controller is taking time to connect to service.

We added continuous monitoring for a couple of hundred requests and checked the ingress controller logs. These logs showed the overall time taken by our service health API.

We also tried keeping the API Gateway in between Ingress and service. We observed the time API Gateway took was also somewhat similar to respond. So we figured out that ingress controller is not the culprit but it is mostly our own service.

Verifying the service fault

Just to be fully sure that service is actually taking a lot of time to respond, we ran a very basic script to hit the service health API hundred times continuously every few seconds. Since our service is deployed on kubernetes so we checked the same script within the pod and from outside the pod as well. Both times, we got similar results that our basic health API calls were taking up to 25 seconds. This step helped us eliminate the Nginx present at the service level.

CPU & Memory Check

On the Grafana, we have dashboards for showing CPU, heap memory usage, threads used etc. So in the next step, we looked into the CPU usage and heap memory consumption. These were not increasing significantly when our service was shown pseudo down (due to delay in health API response). We also checked the Java Flight Recording by generating the JFR file in the pod and analysed it using the JMC tool by copying it outside. Nothing significant was found in the JMC either.

Service Logs

After wandering around a lot of external factors, we thought to cross-check the service by going through the service logs again. We checked specifically the timings when service took more time to respond. Our service calls other microservices via HTTP calls and on observing closely, we identified that we were logging huge JSON requests and responses. Once we removed the logging of request/response, the delay in health API improved slightly. This was our first major victory.

Service Logs Deep-dive — more to verify

After the previous step, the delay was somewhat controlled but yet it was going way above the 2-second threshold set by the common monitoring tool. Now our Ingestion service logs were cleaner after removing requests and responses.

We analyzed that the ingestion service was hitting a lot of external repeated requests to another specific microservice, but the delay was not due to this external service at all. We were hitting 60 requests/sec for a span of a few minutes and this process was repeated every half an hour. We analysed the network IO traffic and it was shown high during the time of delayed response. The average network IO was roughly <200 KBps but whenever there was a huge volume of external API calls, the network IO increased to 500+ KBps.

So we identified how we can refactor our primary service so it does not need to call the external service APIs this frequently.

Now after following all these steps, finally our service was responding well within the time threshold. Below is the same Grafana dashboard after applying the changes:

Lessons Learnt

The microservices may go down even if deployed in kubernetes. What we need to have is a robust alerting module, support for High Availability and Disaster Recovery, and most importantly a zeal to make the services more stable and reliable.

Also, we got a few more learnings:

  • Do not log your requests and responses even if these are small payloads. If needed, just log the summary of request/response like size, count, identifier, etc.
  • Avoid unnecessary network calls/REST API calls: For e.g.- if you need just count, then do not hit API for getAllEntities and then extract count, better to write a new API just to give count.
  • Make microservices and the dependent microservices Highly Available.

--

--