Making a service observable requires appropriate instrumentation of the code, covering important areas of the service’s internal working in order to identify and resolve issues quickly, by using right monitoring mechanisms between APM, Logging, Metrics Reporting and Distributed Tracing. While covering high level of the various monitoring mechanism, I am going to focus specifically on metrics measurement to have visibility into specific areas of the service, to take out guesswork from the equation and relying on data points (metrics) as much as possible.
An Observable service can help with following areas
- Reduce Mean Time To Detect issues - With many Microservices, and dependencies with multiple components, things will fail and we need to be able to identify issues quickly. Goal here should be to proactive in finding issues.
- Once you have appropriate instrumentation, next step would be to define what a normal looks like, and come up with warn and error thresholds and respective alerts, have them integrated with Escalation system/process to get notified to take any corrective actions (automatically or manually).
- Reduce Mean Time To Repair issues, as you have all data points available to look at and pin-point the problem, acts as important data point to build and/or prove hypothesis of the root cause
- Build dashboard, expose important metrics around usage, throughput, errors and latency and group them together logically depending on service’s functionality and related components.
As you start journey on making service (and overall infrastructure and software system) observable, your initial goal would be to improve Mean Time To Repair issues, but soon after you have good handle on this, goal should shift to improving Mean Time To Detect, and think what instrumentation would be necessary to achieve those targets.
Typical Service has following modules
I am familiar with Java, and have experience with spring framework for web application and service development. Depiction below is for Java/Spring Framework (Springboot, recently) based service and doesn’t take into consideration non-blocking model of implementation, so you will need to adjust points outlined here accordingly for your case.
First, let’s cover usage of each instrumentation mechanism. I have mentioned APM (Application Performance Management), Metrics and Logging as the three mechanism to make Service observable.
APM (Application Performance Management)
APM is a tool which is generally agent based which you attach to the host/server, and also to java process, and it gives you a good starting point for monitoring. By default, it will provide host and jvm level metrics to give you a great head start. But, APM alone will mostly not be sufficient for the level of visibility we need to achieve in making service observable. As you can see in the diagram above, while getting details of the “outer shell” of the JVM is good, and nowadays APM also give you visibility into Requests level metrics, like Throughput, Error, and Latency (RED - Rates, Errors, Duration). In this post, my focus is to cover adding instrumentation using custom metrics.
Logging
System Logs: Divided into different types of logs your server and service will produce, can include Linux System Logs (syslog, auth/secure, cron), Container Orchestration System (Kubernetes, AWS ECS, AWS EKS) logs, Database Server logs. These are logs produced by server and other software you use, required for your service. Make sure that you have right level of logs enabled, generally Info level, and at the least Warn level.
Access Logs: Assuming we are talking about a service which is exposed over http, REST API or Web Application, service will produce request level access logs. Embedded or Standalone Java servers like Tomcat, Jetty produce access logs, but make sure these logs are enabled which give out enough details for you to troubleshoot issues, and these also provide good mechanism to understand load/request patterns to understand service usage better.
Application Logs: Application logs are logs produced by the service, by the libraries you are using in your service, and more importantly, code you have written. Make sure that right log level is used when you generate logs, between trace, debug, info, warn, error, fatal etc as appropriate and supported by logging framework you use. In Java, Java Util Logging, Log4J, Logback, and slf4j (log facade/abstraction which allows transparently switching underlying Logging framework) is a good choice. Also, choose logging framework with support of Mapped Diagnostic Context (MDC), as it becomes very useful to pass correlation-id or trace-id to every log message (automatically, with log pattern configuration), as request processing goes through various tiers inside your service implementation (and when done correctly, across multiple services) giving you needed visibility into distributed system in Microservices, sync requests and async workflows. In case of JVM based applications, Garbage Collection logs should also be considered.
Distributed Tracing: Distributed tracing is industry standardized way of tracing (user) request as it flows through multiple Microservices to fulfill a single user request. Distributed tracing libraries help with passing tracing context (like Trace Id and Span Ids) within Service and across Service for commonly used frameworks to take the burden away from having to pass those context explicitly. There are tools building on top of standardized distributed tracing mechanism, further providing view of the world across many Microservices, and helping with observability. When distributed tracing is enabled, we can include Trace Id and Span Id from the distributed tracing context and add them to MDC of Logging framework you use to have those available in every log message, which then helps connect the dots by leveraging same Trace Id used for distributed tracing as well as your application logs.
Custom Metrics with Code Instrumentation
On top of adding right level of logging in your application, capturing or generating correlation-id or trace-id for incoming request and passing it to the next tier within service and across service dependencies is a must to be able to effectively troubleshoot issues in distributed systems setup.
While application logs help with connecting the dots, and following request execution flow within and across services, it should not be used to aggregate (due to size of logs we end up producing) logs and produce statistics based on log messages. I would even argue that using application logs as a replacement of custom metrics to produce various usage, throughput, error, latency statistics is an anti-pattern, which must be avoided and generate logs with appropriate level (info, warn, error), and keep log volume manageable. This is where we need custom metrics created by instrumenting code explicitly. Given importance of these metrics, mature libraries now produce these metrics automatically, but we need to capture those metrics and pass on to a system which is gathering metrics (Prometheus, Graphite, Collectd, AWS CloudWatch) from across whole infrastructure, so that we can query and visualize relevant metrics into a Dashboard, making the dashboard as a place to monitor and troubleshoot service health.
Custom Metrics per Service Module
In this section, I will share my perspective of what are the important metrics an Observable service should capture, starting from Server, JVM and talking about important modules which is what the service is made up of.
Server
Capture server/host metrics and stats to see if server is healthy. Other than standard uptime, CPU, Memory, Disk and Network stats, make sure to capture Packet drops, TCP Connections Open/Reset Counts, Process count and Open files counts, so that any spikes in those two can help identify issues outside the service running on that server. Some or most of these metrics should be available as default if you have APM installed and configured properly.
- Uptime
- CPU: Load Average, Utilization (User, System)
- Memory: Used, Available, Swap Used
- Disk Stats: I/O Requests, Latency, Disk Space Usage
- Network Stats: TCP/UDP Connection Open and Reset Counts, Packet Drop Count, Bandwidth Usage
- Process: Count
- Open Files: Count
JVM
JVM (replace this to other Runtime like .NET) level metrics help get visibility into similar set of metrics as Server level, but more from the view of the Single process which is for your application/service. On top of standard Uptime, CPU, Memory, Threads (not process), make sure to capture details of individual Memory Pool for JVM (at least divided into Heap and Non-Heap memory), and Garbage Collection stats. You could extend JVM metrics by looking into certain JMX MBeans’s properties as well (mechanism used in olden days to expose important metrics to be monitored with JConsole, or JVisualVM).
- Uptime
- CPU: Utilization (User, System)
- Memory: by Pools (Heap, Non-Heap, Other) - Used, Available
- Garbage Collection Stats: Full GC, Parallel GC - Count and Time
- Threads: Count of Total, Live, Daemon
- Other: JMX Console MBeans
HTTP/Request
There are mainly two (or three) flavors of the services, RESTful (HTTP) service, Async/Event Consumer and third Batch (some frequency based) Processing. In this section, we are referring to Service which exposes resources/endpoints for consumption by Web/Mobile/Device Applications or other Microservices in ecosystem. In this section, focus is to capture high level details of incoming requests, latency served by the service, and also look at error breakdown.
- Min, Idle and Max Request Threads: Count
- In-Flight Requests: Busy Request Thread Counts - Request Threads Usage %
- Queued Requests: When Request Thread Pool gets exhausted, some requests could get queued (depending on Web Server config)
- Throughput: Number of requests received per minute or per second
- Latency: Avg, Percentiles, StdDev
- Errors: Counts and Rate. More generally, Response Counts by Response Status (2xx, 3xx, 4xx, 5xx) - Divide into Success, Client Errors and Server Errors
Additional consideration: Assuming you are using right http status codes, you may want to keep 401, 403 separate from 404s, as well as 502 and 504’s (dependent system’s error) separate from server’s internal 500 errors. Depending on your specific scenario and maturity, this level of details to have separate counts for individual http status code may not be necessary.
Server, JVM and HTTP/Request level metrics should be available out of the box if you are using Application Performance Monitoring (APM) tool/agent.
REST Resource
Following three metrics, per “REST resource” supported by the service. For example, first level API routes or context path in URL. You could further divide throughput, latency and error by Request HTTP Method, GET, PUT, POST, PATCH, DELETE which basically aligns with get/search, create, update, delete of the resource.
- Throughput: Requests Per Minute or Per Second
- Latency: Avg, Percentiles, StdDev
- Errors: Counts and Rate - By Response Code, Success, Client Error and Server Error
Additional consideration: If you have Batch API calls, you would want to capture throughput, latency and partial success/error for batch requests separately. And, also capture additional stats/metrics to understand size of batch (sent or received). This will be important to understand batch API usage better, and sudden change in batch size corresponding to latency impact. For batch size, you may want to use meaningful histogram/buckets or Avg and percentile values to have proper visibility.
Health Check
Container orchestration systems expects to have two different types of probes one to check availability and second to check readiness of the service. Liveness probe refers to whether service is up and running or not, and Readiness probe refers to whether service (and its required dependencies) are in healthy status to serve traffic or not. Implementing these probes appropriately is key to avoid containers getting restarted (due to being marked as unhealthy) unnecessarily when downstream required dependency issues are observed.
- Liveness Probe: Latency, Success and Error for this probe
- Readiness Probe: Latency, Success and Error for this probe
For failed cases, additional metrics. For example, number of required dependencies having issues, number of optional dependencies having issues
Dynamic Config
At large, Dynamic Configuration is just like any other dependency, but additionally capturing, how many configs have changed. That way, you have data point available to put in your dashboard, which will come handy when you are trying to correlate events.
Resiliency
Use of Resiliency framework to wrap your downstream calls to dependent system is very important for dealing with failures, and occasional slower response times from dependency. Capture following stats per dependency or logical group of dependencies.
- Throughput: Number of requests per minute or per second
- Latency: Avg, Percentiles, StdDev
- Errors: Count and Rate, Client or Server errors, retry-able errors or not.
Timeouts and Retries
- Timeouts: Count and Rate/Percentage of requests having timeouts.
- Retries: 1 retry, 2 retries, 3 retries. Count and Percentage of requets going through various number of retries. (you must have some maximum attempt/retries limit set, so capture this accordingly)
Additional Consideration: Further, divide number of retries and also capture “Reason for retries” in case of “retry-able error” (depending on the response code from dependency and your own error scenarios)
Circuit Breaker
- Circuit Open: Count of Circuit open events per minute
- Circuit Close: Count of Circuit close events per minute
Additional consideration: You want to capture time elapsed between circuit opening and closing. Circuit breaker status should also reflect in your health check (readiness probe, specifically) so that if your required dependency isn’t healthy for which circuit has tripped, you do not want to recycle containers/servers as restarting your service nodes won’t help with recovery, instead if you have service nodes getting restarted due to misconfigured readiness probe in such scenario, the impact of service nodes getting restarted would hurt with overall recovery of downstream dependency.
Object Pool - Thread, Http/Database Connection or Object
Thread, Connection and Object pools are optimization technique to reuse objects/resources which are costly to create. So, instead of creating new threads, connections and objects, we would just have pool created and use them from the pool, and return back when done. Keeping a close eye on how those pools are being used, whether they are maxing out their capacity, or unnecessarily set to high values provide opportunities for optimization.
- Pool Size: Count - Min, Max, Idle, Current
- Pool Usage: % of Pool Used, what percentage of objects are out of the pool.
- Usage time: How much time Object, Connection stayed out of Pool - Capture Avg, Percentiles, StdDev
Cache
Adding caching layer to a service to reduce load on database and improve performance is common practice. It’s key to understand and design whether your application can work without Cache or it’s required, and make sure its availability (Circuit breaker status, for example) correctly reflects in service’s readiness check.
- Cache Usage: Count of Hits and Misses
- Cache Hit-Miss Ratio: % from of the above hits and miss count
- Latency: Avg, Percentile, StdDev
- Errors: Count and Rate
Async/Event Consumption
Asynchronous interaction between Microservices (using queueing system like SQS, or messaging system like Kafka) is best practice in distributed systems to decouple components from each other, and still achieve desired eventual consistency, or deal with long running or batch processes. Following metrics should be captured to gain visibility into status of work needed to be completed asynchronously.
- Backlog Size: Size of Backlog in Queue or Kafka Topic
- Age of message being consumed: Always keep event timestamp in message, and calculate Avg, Percentiles, StdDev of message age (current/consumption time minus event time)
- Throughput: Rate at which messages are being consumed per minute
- Latency: Time required for processing message after it has been received by consuming system. Avg, Percentile, StdDev. Not to be confused with Age of message.
- Error: Counts and Rate of processing messages
If you have Dead Letter Queue configured, you should extend visibility into it for completeness by capturing above metrics for DLQ. Some of above metrics should be available by default in the Queueing and Messaging system you are using, but capturing Age, Throughput, Error Rate and Latency of message processing would be the key for gaining visibility into health of consuming system.
In case of Latency, Age and other similar metrics, you may want to capture min, max, with other percentiles like p50 (median), p80, p95, p99 for additional visibility into performance characteristics.
Instrument, Instrument, Instrument
In absence of good instrumentation of code to measure important aspects of applications internal workings/health as metrics, I have seen teams trying to leverage application/access logs as a way to figure out some of the details which could have been easily (and quickly) made available using metrics instead.
Balance usage of metrics and logging, both serve specific purpose and can have blurry boundary in some cases. If you are capturing errors, dividing them into logical groups (e.g, client or server errors), you need not add metrics for each exception you may be throwing, that might be overkill from metrics standpoint, I would lean on application error logs for that level of details. On the other hand, dumping log entries with, for example, requests, latency and errors (and similar stats/data points for other areas) per minute is not a good use of logging either. Goal is to have “enough details” available in a dashboard for you to figure out how healthy your service is, and if it’s not healthy, what area or dependency is causing that problem.
Metrics from across all modules of a service may seem overwhelming at first, but instrumenting your code to produce these metrics should not be too difficult or time consuming when done correctly in a central place (at a framework level). Gathering these metrics will not add any noticeable processing or memory overhead as most of them are counters, avg and percentile calculations which are done efficiently in application metrics measurement libraries (e.g, Micrometer). Once you have instrumentation and integration with third-party libraries to capture these metrics, you may finally decide to drop specific measurements like p50, p80 if your Service Level Objectives and Agreements are only based off of p95 or p99. It’s important to be a good citizen of monitoring infrastructure and make sure you are only reporting what’s absolutely necessary in each of your environments to help with sustainability and cost of monitoring infrastructure. Detailed level metrics might be useful in Performance Test environment where you are tuning specific aspects of the service and trying to evaluate (and prove) its impact. Whereas, for Production environment, you may decide to drop certain measurements. If you have ability to turn on/off detailed metrics reporting dynamically, that may be sufficient for cases where you need more details to troubleshoot issues.
Overall goal is to make sure that a building block (one service) of your software system, has right level of observability covering metrics for Server, JVM, Incoming Requests, its own Internals, and Dependencies which helps with goal of reducing Mean Time To Detect and Mean Time To Resolve issues for the service itself. Having good visibility into health and internals of one service or component alone is good, but by no means sufficient to make your entire Software System at large more observable. Repeat this for other components of the software system and related components (Cache Servers, Database Servers, Load Balancers, Network, Security, Storage Layers), which would make your entire system observable, resulting into improved Mean Time To Detect and Repair for whole infrastructure and system, improving stability resulting into overall higher availability/uptime, which makes your customers (hence, everyone in the company) happy.
Depending on your specific Architecture - Monolith, Microservices or Event-Driven Architecture - deployment setup - Server based, Containers, Serverless/Functions deployment, and use of blocking or non-blocking way of serving traffic, specifics of how and what you measure will differ, but point is to think through each area carefully and have right level of visibility by exposing metrics, and have right metrics available on right dashboards, setup alerts matching SLO commitments and integrate them with escalation process to take it further.
Please let me know your thoughts in comments. I will be happy to learn from others’ perspective and experiences in this area.