Microservices Observability: How, when, and what to measure?

Filipe Tavares

Published in

Pipedrive R&D Blog

9 min readNov 2, 2022

Introduction

Unlike many companies that work as teams, at Pipedrive, we divide employees into tribes.

A tribe’s purpose is to fully own the software development lifecycle, from strategy to maintaining a specific business domain. Each tribe is made of engineers, product managers, and product designers and is supported by other multi-disciplinary teams, including data analysts, researchers, product marketing, etc.

Tribes have full autonomy over their area, meaning they decide how to distribute the load.

If you want to learn more about how we work at Pipedrive, I recommend reading this article by our former CTO, Sergei Anikin.

Scaling Pipedrive Engineering — From Teams to Tribes.

Transitioning from Startup to Scaleup is, on many levels, an extremely rewarding process, but it can also lead to the…

medium.com

Pipedrive is being served by a microservices architecture that's spread across over more than 700 microservices and encompasses more than 28k instances of those services.

We continually develop our product which means we keep introducing new features, improving existing ones, and fixing vulnerabilities and bugs, leading to an average of 150 deployments daily.

On top of that, to serve our customers across the globe, we're currently deploying to 6 locations, including a management region and excluding staging and testing environments.

Given the scale of deployments and to minimize issues, we deploy gradually: first to a subset of users and — once no issues were detected — to all other users.

While this measure is crucial to mitigate issues, it doesn't prevent them altogether. And this is where observability comes into play.

In the following article, we'll introduce observability, explain why we use it and review some patterns to take into account when building an observability stack.

Why observability

Let’s start by explaining why observability is important through an incident example.

Sometimes, users report product issues outside of our working hours. When this happens, the support team logs these issues and escalates them to our support engineering team members. If they can’t understand their root cause, they’ll escalate them to an incident and notify our engineering team, which always has a person on-call.

The on-call engineer(s) would then start investigating the issue at hand using any application/tool to help them identify its cause and fix the bug. Now, everything is back to normal again.

A few important considerations:

If our product is unreliable, users will stop using it. We should not be made aware of problems by our users
Bugs happen and to fix them, we might call the dedicated employee outside of working hours
The core challenge is not resolving the issue but detecting it as early as possible

Observability helps mitigate these considerations, allowing us to better understand our systems' behavior. Let’s understand what observability is and how it can contribute to these.

What’s observability?

“Observability is a superset of monitoring.”

Monitoring doesn’t exist without observability. The better your observability is, the better monitoring you can set up.

Monitoring notifies us when something goes amiss and is, therefore, considered reactive. Observability, on the other hand, is considered proactive because it allows us to act before an issue occurs and, thereby, prevent it.

Observability will:

Provide granular insights into our system
Provide a high-level overview of our systems
Provide an ample context on our product's inner workings, enabling us to discover deeper and often hidden issues

The three pillars of observability

Metrics — Help us detect issues in real-time and provide a historical behavior of our system
Logs — Allow us to understand why the issue has occurred, especially if it has to do with the code
Traces — Help us pinpoint the issue in a granular, complex system

Observability challenges

Microservices Architecture

In a microservice architecture, you need to know your dependencies since the service normally communicates with other services.

Multiple services mean maintenance work will also take longer, especially when you need to update libraries used for observability, like tracing and logging.

Multi-language

To provide a great experience for developers, observability should be independent of the service language.

Supporting multiple languages might be a burden on your team, so it's important to keep standardization in mind.

Volume

Always remember the more you want to observe, the higher the cost will be.

Software-as-a-Service

SaaS products usually entail certain challenges when it comes to multi-tenancy and multi-region, not only due to the need to serve customers in different locations but also taking into consideration the metrics cardinality needed to understand the performance of individual customers.

The observing experience should be as integrated as possible and independent of the number of locations and customers you are managing. This means you’d probably want to limit the number of tools you use.

Observability benefits

Now that we reviewed some of the challenges, it’s time to review the benefits of observability and their impact on your business.

Observability lets you understand your system and product and predict its reactions and functions.

While understanding your system's behavior, you’ll be able to identify the root cause of issues happening in your system, contributing to faster incident resolution by decreasing MTTD.

Mean time to detect (MTTD) is one of the main key performance indicators in incident management. It measures the average period of time between the beginning of an incident and the amount of time it takes the organization to identify the issue.

While observability is important, try to keep your alerts to a minimum and separate them into levels to avoid alert saturation and fatigue. Once you receive a warning, look into its trigger and try to address it to prevent potential malfunctions. Being proactive will help you tackle issues before they arise, leading to a better experience for both your developers and customers:

Developers will be able to deploy confidently knowing tracking is enabled, allowing them to roll back changes quickly if an issue arises
Customers will experience fewer incidents and a shorter resolution time thanks to fast detection

For example, we track message brokers’ lag and have different alert thresholds depending on lag time. Once the first threshold is reached, although it’s still not noticeable for our customers, we start investigating. The number of incidents perceived by our customers is drastically reduced by proactively investigating and fixing them. This process can be implemented for several other use cases.

Observability patterns

Now that we've covered the challenges and benefits of an observed system, let’s break down its patterns.

Healthcheck pattern

In a microservices architecture, it’s normal to have multiple instances of your service running simultaneously, both for redundancy and scaling purposes.

When deploying your service, it won’t be able to receive requests from any other party — external or internal — right away.

The Healthcheck pattern will allow your service to estimate its readiness to receive requests when prompted by your service discovery tool. Within a healthcheck validation, your service might perform validations against different dependencies:

Databases
Message brokers
Storage
Other services
Etc.

If your service has no instances ready, this should trigger a critical alert. You might consider setting up an automation solution to try to remediate the failing service.

Logging pattern

Logging is a great indicator of your service’s stability. Logging can give you detailed information about something that went wrong, especially if you include some stack traces. You can easily look into the code to understand why it returned an error.

Since in a microservices architecture, logs and services are scattered along your infrastructure, you should aggregate them in a central logging server. Consider a log processor in case you’d want to standardize or/and filter them, for example, for GDPR compliance purposes.

Logs should be searchable and analyzable to enable quick discovery of existing issues. Include your infrastructure logs along with your service logs, so you can refer to them if anything happens to the infrastructure.

Application Metrics pattern

Although the name represents “application,” an application is the composite of both the system/infrastructure and the metrics your service provides to understand its state.

Some examples of system metrics you might want to collect include CPU, memory, and network. You can be more creative about your application metrics depending on your needs. Some examples of information your application metrics should provide you include the following:

Requests throughput per second per endpoint
Request duration per endpoint
Error rate

Remember: Retention comes at a cost. Use your judgment when determining the metrics for collection and retention time.

Distributed Tracing pattern

As multiple services can be involved in a single API call, we must apply a unique identifier throughout the entire flow.

To understand your application in detail, you can enrich the collected traces with custom data.

A trace is a collection of spans, which will give you detailed information about your system behavior.

Here’s an example of a call to Service A:

Which is calling services B and E sequentially
Service B is calling services C and D sequentially
Every call duration within this trace would have an impact on Service A’s response time. This means you can easily see which spans are impacting your transaction times the most
The level of detail you can retrieve is up to you. Usually, calls to external services and infrastructure are instrumented by APM libraries out of the box, but you can create your custom instrumentation depending on your specific needs

An example of a distributed trace involving multiple services

Retention

As mentioned above, retention comes at a cost. To keep your costs at bay, define metrics’ retention according to their long-term value. For example, endpoint response times are sometimes used for years, while CPU usage and logs are often used for 30 days or less, and traces can be retained for a week. Remember: The more granular the metric is, the more storage space it will occupy.

Developer Experience

Last but not least: Always take developer experience into account.

Usage of such tools should keep UX in mind. For example, information spread across multiple tools might result in a negative experience given multi-region challenges.

Drive tools adoption and provide appropriate guidance for each. You’re not making the most of your observable system if they aren't used.

Lessons learned

A compilation of phrases to keep in mind.

“Never hear that your service is broken from your customers”
“You can only track what you have visibility over”
“You need to know your limits”
“Automate whenever you can”

Inspiration from

“Everyone teach one”