Using service graphs to reduce MTTR in a HTTP-based architecture


Microservices is a new trend in software development which many organizations around the world are adopting (and counting…). From e-commerce websites, to social network platforms, everybody is talking about this new way to design software architectures. If you have premises like scalability and high availability or if you want to reorganize your teams around business capabilities, this new approach can help you. A lot.

The growth of the "Microservices" term in Google Search in the last 5 years (source: Google Trends)

At B2W, we adopted this new paradigm in mid/2014 and since that, we have had enormous gains on our projects, but with a price. Sometimes with a huge price. Microservices brings new challenges, and in my opinion, dealing with many dependencies is the biggest one. As your business grows, you'll probably have more new services over time. Consequently, problems related to monitoring and operating the platform will also grow exponentially at the same rate.

If you didn’t architect your platform considering basic aspects such as resiliency, it’s not rare that one component (Microservice) breaks the whole system and the consequence can be thousands, or even millions of dollars in losses.

The problem

A situation that we must avoid at B2W is stop selling. When something goes wrong we need to know what happened as soon as possible. But as we have a lot of Microservices, sometimes find the problem is like to find a needle in a haystack… a real nightmare!

We have been using a lot of monitoring and alert tools to help us in the day-to-day operation. We also use logging extensively, recording every relevant information in our ELK stack. But as I said before, in a Microservice architecture the whole is important. Even if you designed your architecture to be resilient, there are some moments where a unified vision of the whole is the most important. Even more when you have a symptom like an alarm in many applications/services at the same time.

The motivation

InVisualize is a visual tool intended to provide insights over LinkedIn’s platform health and, inspired by that, I started to think about how we can implement this at B2W.

B2W is an e-commerce platform and our revenue depends a lot on our availability. We must be online 24x7. But reality is tough and as Werner Vogels says:

“Everything fails all the time"

There is a specific measure that is very common when we're talking about maintainability, it is called MTTR (mean time to repair).

It means how long does it take to solve a problem, or how long we take to identify an unexpected situation and fixed it.

Setted by this mindset provided by Mr. Vogels and considering that our MTTR must be as lower as possible, I get back to the question: how can we monitor a hundred of Microservices in an efficient manner?

The idea

When our platform is degrading, usually we consider three aspects:

  • Latency: response time between two services;
  • Error rate: HTTP 5** family errors;
  • Throughput: the number of requests from one service to another in a given period.

The throughput metric alone doesn't mean so much isolatedly, but it can be the cause of a high latency scenario. In the same way, a high latency scenario can explain a high error rate scenario.

But how do we know it? Using levels. For each indicator is possible to have three different levels (one at a time): normal, warning and critical. They are self-explanatory.

To represent a relationship between two services, we use graphs. A Graph, is a very appropriated data type to represent relationships, and in our case, what we want is to represent the relationship among services (N:N).

A simple graph relationship (source: Wikipedia)

And what about the visual representation of each indicator mixing with levels? That's a good question.

For latency, we're using the line type. If the line is not dotted the situation is considered normal. If the line is dotted, but with small spaces, we consider a warning situation. And if the line is dotted with long spaces, probably we're in a critical situation. For error rate, we're using the line color. Blue represents normal, yellow represents warning and red represents a critical situation. And finally, for the throughput indicator, we’re using the line fill. The thicker the line, more latency (critical) we have.


We must avoid scenarios like this:

When an e-commerce portal stop selling, that's the sensation (source: ripe)

During one of our Hackathons, I joined with 4 brilliant guys in a team to develop InnerSection in 24 hours. The outcomes were really incredible.

Our mission were to create a holistic vision represented by a graph of our platform, containing all Microservices running in production. To confirm my theory I started to think about all the limitations that we had without InnerSection, and that's what I got:

  • Alarms work very well, but they need to exists;
  • APM's are also good, but we must know where is the problem;
  • Logs are necessary, however all applications must log in a specific pattern, otherwise, will not be useful to correlate problems;
  • As more service we have, more difficult will be to detect problems, then a visual representation looks interesting.

InnerSection's Features

  • Visual representation must be a services graph;
  • Three indicators that we use to infer a problem: error rate, latency, throughput;
  • Three levels for each indicator: normal, warning and critical;
  • Well defined symbologies per indicator and levels.

InnerSection’s Premises

  • Easy to plug: minimum of configuration;
  • Agnostic: it does not depend on a specific technology or infrastructure design;
  • Not intrusive: can not impact the application;
  • Without customization: you don't need to do anything in the application.

Technology stack and components

This approach allow us, to still use the best technology to solve a specific problem. Technology agnosticism is one of our premises in B2W.

InnerSection’s Stack — Big Picture

Below, the four main components of InnerSection's architecture.


The agent, developed in GoLang, monitors all TCP Packets in a machine and identify which are HTTP. We’re doing this using libpcap, same that tcpdump uses.

This agent has the ability to store all packets/events in a local buffer, in memory, and when a size/bytes or a time period is reached, it flush them.

A flush is an action about send a set of events in a specific pre-formatted JSON to a HTTP API, that we call Ingest API.

Using an agent helped us to keep our premises related to agnosticism, isolation, and be easy to plug and not intrusive. The agent uses API's hostname to identify the target, and an OS environment variable to identify which API is the source.

Ingest API

We're using the async java driver of MongoDB. It works with callbacks to ensure that all operations in the database are non-blocking. In the same way, the Ingest API is async too, so every call returns a 202 HTTP status code, and another thread do the hard work to persist the data.

We have plans to use a message broker, like Kafka or Kinesis, to decouple and scale better the production of the consumption of data.

Graph API

We decided to calculate each score in this API rather than Ingest API, because if we change some business rule associated to the score calculation, we wouldn’t need to reprocess the entire dataset.


The frontend constantly pool new fresh data from Graph API, is like a living organism.

I know that at this point, maybe you are thinking about a better and efficient way to retrieve data. We thought the same. Something bidirectional makes sense for this use case, like Websocket.

How we calculate the scores for each level?

There are two important things here. The first thing is that all relationships are 1:1, that is, caller (API A)-> provider (API B). All indicators (error rate, latency and throughput) and levels are also defined per relationship. This is how data is persisted in MongoDB.

The second thing is that some indicators are fixed, based on our experience and observation acquired by empirical knowledge.

Considering that, we have:

Error Rate (fixed range)

Latency (fixed range)

Throughput (dynamic range)

We consider the throughput (service calls) of all the services in the graph and from this we define three levels based on the median.

To increase the range of the "warning" group, that is the median, we applied a factor of 1.01. The same for the "critical" group, where we applied a factor of 1.05.

To clarify what I'm saying, please consider this specific set (RPM K): {190, 100, 80, 55, 45, 10, 5}.

The median of this given set should be 55, but to avoid outliers, we applied the average of all numbers after the median (critical group), in this case (190,100,80). The average should be: 90.

Considering the rule above and applying factors, the three levels will be like this: normal < warning (55550 RPM) < critical (94500 RPM)

Show me!

Each node in the graph is a different Microservice.

InnerSection Working! (data is not from production environments)

Why only HTTP services and not other components are supported in InnerSection, like databases?

Futhermore, would be insane to understand and monitor each protocol of any dependency (filesystem, message brokers, etc) of a Microservice.


Future and Limitations

Another important thing is to support containers running in a host machine. As in a container normally you only have one process, a good feature can be run the agent from the host machine.

We're also working on have a better way to deal with applications that don't have a semantic hostname. In this case, is difficult to identify the application in the service graph. Currently, you can configure this option through a properties file, but ideally it should be automatic.

Finally, the main idea is to open InnerSection to the community, but some fine tuning is necessary before release a public version.

Many thanks to Thiago Rodrigues, Rogerio Lacerda, João Paulo Faria e Guilherme Paixão for all the partnership and to believe in the idea of InnerSection.

In the next article I discuss about how design an authorization and authentication layer for you Microservices.

Currently @OLX. Previously @Amazon Web Services and @B2W Digital.