Way to go for OneShop Graphically — Observing REST APIs Latencies and Success/Failure Metrics through Graphana and Prometheus

Published in

Deutsche Telekom Digital Labs

6 min readDec 23, 2020

Since the advent of OneShop, we have been trying to make things easier for internal management. In making our way to put OneShop at better heights, we were destroying hurdles one by one. One such problem we were facing — to analyse the REST APIs performance. Before we start on analysing our REST APIs, let’s have a brief look on what actually OneShop is.

One Shop :

It is a centrally built E-commerce platform that complies with the requirements of all 12 National Companies (NatCos) which have different business portfolios and capabilities in Europe. It is currently operational in Czech, Poland, Hungary, Slovakia, and Macedonia.

As it is a central application for multiple NatCos, the application has been growing with new incoming features that bring with itself a new set of challenges to deal with.

Here at DTDL, we brainstorm upon all the challenges we face and make strategies to mitigate them to deliver an awesome quality product. When we came across analysing REST APIs graphically, we did our part in researching on best way to achieve it and then implementing it in best way possible.

Need of the hour :

We at OneShop were currently capturing track-logs for REST APIs, which are ofcourse helpful in debugging an issue. Problem is we have around 18+ micro-services that are deployed over AWS EKS K8. As and when we run into business flow failure, finding out the culprit API through track-logs is becoming a cumbersome task as we are aggressively developing new features in parallel and hence debugging required developer’s support. We needed to analyse

How an API is performing in terms of latency.
What is an API’s SUCCESS or FAILURE percentage.
Number of times an API is called in, say, a minute.

Apart from these problems, developers faced other stack of issues :

Need to have code flow understanding with new features being developed every fortnightly.
Debugging through logs required rigorous analysis.
Visualisation around a business flow was also not easily achievable because of the immenseness of the system.
Finding out the culprit API in a business flow is only possible after thorough analysis.

Finding the Best Solution :

First, summary of the problem we were facing.

Problem :
Let’s take a simple example of “Add To Cart” Journey. A user comes on OneShop and adds an item to cart. At backend, there were many APIs, playing their roles. For example :

API call from UI, hits Bff spring boot service, which further calls our Shopping Cart micro-service and persists newly created cart in DB.
Second API calls Sales Catalog micro-service for fetching offerings(details of cart items) from HAL(NatCos Backend) and/or local storage maintained at DTDL.
Third API call from Shopping Cart micro-service hits Sales Catalog micro-service, which further calls NatCos Backend for performing validations of the cart.

Some more APIs comes into picture, but we will keep it simple for better comprehension.
Now, this journey might break within this flow of subsequent API calls OR some API is taking more time than expected which contributes to bad User experience. We were interested to identify the culprit here.

Exploring Solutions :
We started researching on how to address the problems we were facing and found out that generating “Metrics for APIs” would be the best fit for our use case. There were plenty of tools available online paid and free which can be integrated for generating Metrics, but what we were interested in was to resolve the problems listed above, nothing more than that ( as more features invites complexity in system).

We explored two famous solutions available :

ElasticSearch + MetricBeat + Kibana (ELK)
Prometheus + Grafana

How Prometheus and Grafana won :

Since we were interested in identifying the culprit API, we generated metrics for all the APIs involved in User Journey in question. Apparently, Prometheus & Graphana became the winner as it is based on a “Time Series” database which stores API metrics with accuracy of milliseconds, which further can be queried upon and visualised in Graphana UI. We can now aggregate on these metrics and identify average of — failure or success rate and average of — API latencies involved in a User Journey with better precision and that too can be visualised through graphs on Graphana UI, which solved our problem.
Other factors contributing to (Prometheus & Graphana)’s victory are :

Documentation :
When you start on something new, good documentation helps you understand and visualise better. Since Prometheus & Grafana has better documentation than ELK and ELK documentation was missing out on things we were specifically interested in, hence as per our use case we were started to incline towards Prometheus & Grafana.
SetUp :
MetricBeat has a to be given much more time to set up compared with Prometheus. In MetricBeat, there were different modules and each module has to be enabled separately. Moreover, version of MetricBeat have had to be compatible with Elastic, else it won’t work.
Ease of Use :
Setting up Prometheus & Grafana is far more easier as compared to ELK. Query Language of Prometheus (PromQL) has good documentation for monitoring and querying the generated metrics. We can add our custom metrics using “Micrometer”, which seemed unfeasible with ELK.
Visualisation :
Visualisation in Kibana as well Grafana both were comparable, but Kibana provided set of existing queries to visualise data and as with Grafana, we get far more flexibility in terms of queries as it was backed by PromQL.
Alerting :
Prometheus & Grafana provides better options for alerting. We can customise and send alerts to many channels. Click here, to get list of channels supported by Grafana.
Metrics :
MetricBeat only gave certain set of metrics. There might be ways to achieve what we wanted through MetricBeat. When we were comparing MeticBeat with Prometheus & Grafana, clearly, Prometheus & Grafana declared the winner.

Solution with Prometheus and Grafana :

Metrics : A metric is a quantifiable measure that is used to track and assess the status of a specific process (an API in our case).

We need to move to a strategy where we will talk and analyse the following only in terms of numbers :
1. API performance
2. API success/failure
3. Business flow state
Alerts will be generated on :
1. failure percentage over and above threshold
2. latency over and above threshold
Threshold percentage or latency can vary from one business flow to another.
Every failure will no longer be considered as an alert.

Implementation of Prometheus and Grafana :

Architecture of implementation :

For implementation, we need three things in place :

Micrometer
Prometheus (data exposed at actuator endpoint)
Grafana

We added micrometer-core module and registry with prometheus as dependency in pom.xml.

<dependency>
   <groupId>io.micrometer</groupId>
   <artifactId>micrometer-core</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

We created custom annotations to put at Controller class’s method level in each of the Micro-services participating in a businessFlow we were interested to monitor.

Example :

@LogMe(enableMetrics = true, metricName = “GET_CART”)

“enableMetrics = true” → to record Metrics at the method level it is applied on.

“metricName = GET_CART” → Name of Metric generated.

Code level implementation of this custom annotation used Timer Interface of Micrometer library to actually record metrics.

Metrics generated by Micrometer is exposed at “/actuator/prometheus” endpoint. This endpoint is scraped by Prometheus server in predefined intervals(in our case 15 seconds) which further saves this Metrics data in time-series database. After learning a bit of PromQL, querying on time-series DB was easy to implement.

Example of data exposed at “/actuator/prometheus” endpoint :

GET_CART_seconds_bucket{env="dev",errorType="none",operationName="CartReadController-getShoppingCart-API",owner="ONESHOP",serviceId="eshop",tenant="in",le="0.001048576",} 0.0

Post setting up prometheus as dataSource, creating dashboards/panels and applying promQL in Grafana UI, we are good to go for monitoring graphs generated by our APIs.

Example of PromQL used in Grafana UI :

(histogram_quantile(0.99, sum(rate(GET_CART_seconds_bucket{tenant=”in”}[2m])) by (le, operationName, owner))) * 1000

Below is How Latency Graph actually looks like based on multiple queries like the one listed above :

For further information on Prometheus and Grafana for instrumentation and monitoring, go on to explore official documentations and blogs:

Prometheus : https://prometheus.io/docs/introduction/overview/
Graphana : https://grafana.com/docs/grafana/latest/
Installation : https://www.callicoder.com/spring-boot-actuator-metrics-monitoring-dashboard-prometheus-grafana/
Alerts Set Up : https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/
Good Practices : https://tech.willhaben.at/monitoring-metrics-using-prometheus-a6d498dfcfba
Micrometer & Prometheus : https://micrometer.io/docs/registry/prometheus
Prometheus and Graphana with Spring Boot : https://stackabuse.com/monitoring-spring-boot-apps-with-micrometer-prometheus-and-grafana/
Histogram Queries : https://kausal.co/blog/prometheus-histograms-multiple-queries/
Rate Function : https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/?GAID=1130575369.1595320838Tracking Request Durations : https://povilasv.me/prometheus-tracking-request-duration/#