Implementing Observaility for Microservices
How to use open source tools to implement observability for microservices
Introduction
One of the biggest challenges of distributed system maintenance is the troubleshooting when system fails. This challenge has been further complicated with the microservices architecture since it increases the distributed nature of the solution by design. Hence, it is of utmost importance that we instill observability into the system from the beginning. The term “observability” defines how well the internal states of a system can be inferred from the knowledge of its external outputs. We use various monitoring tools to monitor the external outputs that are produced by the system during the execution and use them to understand the internal state based on those results to troubleshoot failure senarios.
Types of data to observe
When applications are developed and deployed to a computing infrastructure such as on-premise data center or a cloud environment (e.g. AWS, Azure), we need to monitor the application in different perspectives to make sure failures are properly handled. At a high level, we can identify 4 types of data we monitor in a distributed application.
- Server (host) monitoring: This is about monitoring the status of the server so that it is performing at a good level without underutilization or overutilization. We can monitor the CPU usage, memory usage, and load average on the server against a certain threshold value. Typically, we consider a server as overutilized if the CPU usage is going over 60–70%. There are mechanisms used by applications to monitor abnormal activities such as high CPU usage, high memory usage, and frequent error logs. They then generate alerts and notifications so that the operations teams can take action on those notifications to rectify the system and avoid future failures.
- Application monitoring (metrics): This is where we monitor the application-level statistics (metrics), such as latency, error rates, and request rates, so that we can make decisions on when to scale out the services based on the demand from the consumers.
- Log monitoring: This is where most of the observability-related information is monitored and accessed. Microservices will utilize logging to output the details of the system state via different log entries using different categories, such as INFO, ERROR, and DEBUG, which will then be used to infer the state of the system and identify the root causes of failures.
- Health monitoring: This is where the application’s health is monitored continuously through heartbeats, alerts, and notifications. The heartbeat check is the monitoring endpoint used by front-facing components such as load balancers to verify the availability of the server or the service itself. In most cases, this is a separate endpoint per server and/or per service that responds immediately with a simple response so that the client (for example, a load balancer) can verify the status. These health checks are run periodically at the load balancer level so that it routes the requests only to the available services.
Let us discuss how we can design a microservices architecture with observability that will monitor the above mentioned types of data.
Microservices Observability Architecture
Let us take a simple example where we have an application with 2 microservices that communicates with each other using a messaging platform such as NATS messaging. This architecture can be scaled to 100s or 1000s of microservices if required. For the sake of this article, we’ll take a solution with 2 microservices. The below diagram depicts how we can implement observability for such a solution.
The preceding figure consists of following components.
- Microservices (A and B) — Actual business logic implemented here
- NATS server — Used for inter-service communication
- Promtail — Open source Log collector tool
- Loki — Open source log analyzing tool
- Grafana — Open source data visualization tool
Let us discuss the components in detail below.
Each microservice has 2 APIs exposed for monitoring purposes.
- /healthz — This API provides information regarding the health and availability of the microservice. This is used by load balancers and/or proxies to route traffic to these services.
- /metrics — This API provides the metrics related to the microservice such as latency data, TPS data, CPU/Memory usage, etc.
In addition to these APIs, each microservice outputs critical observability data to its own log file (A.log and B.log).
NATS server has its own health and metrics API via the /varz endpoint and other related endpoints. NATS server also has a separate log file to output the observability data on the server related data.
Prometheus NATS exporter tool is used to convert the NATS server endpoints via a /metrics API so that prometheus can collect metrics data.
Prometheus collects all the metrics related data via metrics endpoints exposed by microservices and the NATS server and analyze them to derive critical business decisions. It integrates with Grafana for creating visualizations for further analysis.
All the log files are collected via Promtail and aggregated and indexed using Loki for further analysis. These indexed log data are published to Grafana for visualization.
Grafana provides the option to create custom visualizations based on the log, metrics data that are collected through Loki and Prometheus.
Implementation
You can find a reference implementation of the above mentioned architecture in the following GitHub repository.
Learn more
You can learn more on this by reading the book “Designing Microservices Platforms with NATS” from the link below.