Our CEO often mentions this excellent quote — What gets measured gets done. This profound advice is applicable in every field — business, finance, technology and others. One needs to measure a problem before solving it.
In case of products that power businesses and consumer journeys, how do we measure efficiency of business operations and end user experiences? For example — How much time does it take to process inventory updates across multiple services or how many end user journeys are affected by system errors and so on.
Business operations and end user journeys are powered by software services and systems we build. In a service oriented architecture, services communicate with each other via APIs. Measuring performance of the APIs, their availability, latencies are essential to measuring business operations efficiencies and end user experiences.
Higher level business and system metrics are tied with low level telemetry. We need to measure right API performance metrics to measure the efficiencies of business ops and journeys.
In this post I will talk about how we built a dashboard to provide the real time performance of our APIs
About the Dashboard
It showcases followings performance metrics for each of our API over a period.
- Error Percentage
- Requests per Second (RPS)
- Number of 4xx, 5xx API Calls
- Percentile latencies (90, 95,99)
How is it powered?
All our inter service communication is done via APIs. We use an API gateway called Kong in all inter service API calls i.e. When service A calls API of service B, it is done via API gateway.
Kong API gateway is built on top of famous NGINX. Kong processes the API requests and proxies it to the upstream service API. A detailed explanation about Kong might be out of scope here, but you might wanna read this blog post to know more.
The most powerful capability of kong is that it can execute a plugin before it proxies the requests to upstream service. A kong plugin is a pieces of code (in lua) that is executed before calling the upstream API.
There are several kong plugins available. For examples — Plugins for traffic control, security, monitoring, logging etc. Browse a complete list of kong plugins here.
We used the TCP Logging plugin which pushes logs of API requests and responses to a TCP server. We created a tcp server using logstash’s tcp-input plugin. We pipe this data to kafka using logstash’s kafka output plugin. Then we use kafka’s elasticsearch connector to move data from kafka to elasticsearch.
This entire pipeline looks like this:
Using this data pipeline the requests, response information, status, latencies of every API call arrives into elasticsearch in real time. This is powerful.
One might question why use kafka in above data pipeline? Logstash already has output plugin for elasticsearch, we could have sent data to elasticsearch from logstash. Why pipe it through kafka? This is done to make sure we write into elasticsearch at fixed rate. Without kafka, data would be directly written into Elasticsearch from logstash. During traffic bursts it might be beyond write throughput of elastisearch. To avoid this either we put a strong machine or write into elasticsearch at a fixed rate. Kafka acts like a buffer in middle and holds the data while it is being written into elasticsearch at a fixed rate. The obvious trade off with this approach is that makes it near-real-time as data is queued up in Kafka topic queues.
We built a small service that provides APIs to fetch data from elasticsearch and power dashboard.
Not just API performance, this data gives us lots of more insights and enables more use cases —
- Inter Service Communication Graph — Which all services are communicating with each other. What APIs of a service are being used by other services. This is very handy information when changing APIs or deprecating them.
- Alert on API performance — We built alerts when API responses are high or they are failing.
- Kubernetes pod identification— We also get to know the IP address of pod servicing request. With this we can track the request to exact pod and check the pod health, logs. This is very useful for developers.
This dashboard gives one view of all APIs and is often the starting point for engineers and support team to trace production issues. Erroneous requests are traced further into APM tools like Grafana and PinPoint to find the root causes.
So remember — “What gets measured, gets done”.
Thank you for reading.