The Significance of Effective Observability

Richa

Published in

Deloitte UK Engineering Blog

9 min read4 days ago

As software developers involved in everyday software development, the situation below may sound all too familiar:

Developer: “I have fixed the issue now and it’s good to go”

Tester: “I have added relevant test cases, and all seem to pass so ready to deploy to Prod”

Unfortunately, following the deployment of the functionality to Production, issues began to surface within a few hours.

Developer: “Argh! I expected to see this logline, but I cannot understand why the code is not reaching that point!

We are now amid a typical development crisis — should we rollback the change or add more log lines and attempt to debug in Production?

No matter which programming language is used or whether building cloud-native applications or traditional applications, logging and monitoring are the “eyes and ears” once the application is live. Following an investigative approach to closely analyse these distributed interactions can reveal various underlying issues (e.g. scalability, poorly written code, application security or software configuration issues).

Effective logging and monitoring are skills that one acquires with experience (and sometimes from experiences such as above), some software engineers may call it a “developer instinct”. Logging and monitoring are interconnected — while server statistics (e.g. CPU, memory usage etc.) are valuable, it soon becomes essential to set application monitoring to gain deeper insights which is based on application logging.

Logging

We have been logging without noticing — The first time we wrote “Hello, World!” to a terminal, we were logging. We have been logging even when computer programs weren’t a thing- logging transactions into a physical register, a slight difference being that programmatically generated logs need to be readable by humans and parseable by computers and are based on append-only list (i.e. we can only ever add new entries but cannot go back and change existing entries). For this article, we would be referring logging to imply application logging.

Strange as it may sound, most of us would have come across applications that have very minimum or no pre-existing logs. Also depending on the programming language and logging library being used, sometimes logging itself can trigger an exception e.g. log line containing null /undefined objects without being Optional or circular references causing infinite loops.

Here are some of the recommendations around logging based on my experience:

Adopt defensive logging — As we set traps for the inevitable and unsuspecting bugs, let us not forget to include a meaningful and informative log message. Don’t swallow exceptions, instead let them make appropriate noise so they can help in debugging. Try to create/use exception/error types so when they are thrown the type explains the type of problem (408 indicates network timeout or 504 indicating an error with the gateway), the logging line explains the context (get Share Price Prediction from Z ‘s API to get X) and impact (unable to calculate the share price of A when B happens).
Choose the right logging format — The JSON (JavaScript Object Notation) format is highly effective for log messages and is widely compatible with most common parsing tools, making it an ideal choice for enabling efficient searching. When building this format, think about what common attributes should be included in all log statements so that each log statement can work independently and is not based on the context only available in earlier or later log statements.
Use dynamic logging — Variables and objects could be added as a tuple in a logging message. Log lines with variables enables metric filters to be built using regexes (regular expressions). If log lines are used in regexes for metric filters/alarms, ensure the message always match by making assertions part of unit testing.
Use child loggers to add context — Using child loggers when threading or executing things in parallel can provide valuable information if needed to debug such applications under high load. In such cases, it would be useful if the underlying logging framework could emit an ascending sequence number as log delivery (source to log aggregation tool) may be asynchronous and subject to clock skew hence ascending time may not necessarily be the order in which they were generated.
Be careful when logging sensitive data — Include enough information or convert the information such that log lines don’t include sensitive information. Developing logging frameworks can help to obfuscate such fields in a common, reusable way.
Add a unique correlation ID — To trace a journey or track the flow of events across microservices or other distributed systems, it is always beneficial to add a unique identifier associated with that message that can then be appended to any subsequent log messages making it easier to debug in live environments.
Consider costs associated with logging — Sometimes the costliest service in an application turns out to be the one which is ingesting all the logs, so use appropriate log levels and log where necessary only.
Add test assertions for logs — Presence of certain logs can be verified as part of unit and integration test assertions as they are serving as effective checkpoints in the code flow.
Choose a useful tool/library for logging — There are always a few common ones recommended depending on the programming language, I have come across Log4j, Winston, Logger — Powertools for AWS Lambda (TypeScript), sl4j and Pino. One could also use these to develop a custom logging framework for their application. It is quite crucial to keep these logging libraries up to date regularly e.g. there was a severe vulnerability found in Log4J highlighting its importance.
Ensure log immutability — Logs themselves need to be stored in a system that ensures that once written those cannot be modified. This could be done by restricting permissions to the storage system and even raising alerts if attempt to tamper was noticed.
Centralise logs — To derive maximum benefit from logs, it is advisable to use architecture and services that can collect relevant logs across distributed systems and send them to a central system. For example, this method to get AWS data into Splunk: https://docs.splunk.com/Documentation/SVA/current/Architectures/AWSGDI#Push_method. Apart from giving a centralised view of requests across multiple systems, this approach also helps in access control specially for live services team.

Monitoring

“Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.”
- Brian Redman
“Ways in which things go right are special cases of the ways in which things go wrong.”
- John Allspaw

Surfacing Log information

Logging, being the foundation of observability, can be used to monitor key points in the application or user journey. To simplify observability, metrics/counts can be taken from scanning logs via string/regex filtering OR emitted directly by an app and displayed on a dashboard — “Visually observable”. Metrics could be emitted directly from code instead of the logging-> filtering -> Metric approach for activity monitoring. For example, if your application is on AWS Cloud, this could be done either synchronously via call to cloudwatch:PutMetricData API or asynchronously via AWS CloudWatch logs (EMF) using Metrics — Powertools for AWS Lambda (TypeScript). It is worth noting that there is a particular advantage in some situations to directly emitting metrics: for example, where metrics are only stored in aggregate and the underlying log data is not present, hence can help preserve anonymity. For instance, if a metric counts fraud events, the count is useful and the link back to any individual transaction being counted as fraudulent is not logged.

Whitebox vs Blackbox monitoring

The SRE book states:

Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

Whitebox monitoring could give us information on imminent problems — e.g. a drop in the number of users able to log in or a drop in the number of users completing specific journeys or an increase in error rates. This could further have an automation/human/AI based action linked to it to prevent the issues.

Blackbox monitoring on the other hand would alert on active problems — critical issues happening right now which need to be investigated.

Apart from monitoring dashboards for internal use, externally facing status pages e.g. https://www.githubstatus.com/ showing the health of individual microservices is now the norm for any publicly available service.

Although the terms monitoring and observability are used interchangeably in DevOps as both relate to telemetry data, it is worth noting a key difference — monitoring relates to when and what of a system error while observability is the why and how.

Alerting

Gone are the days when alerting meant a production incident or some overly conscientious developers searching for random log lines every day hoping their anticipated error scenario never transpired. A basic level of monitoring needs to be in place for understanding the health of the system but one cannot be expected to be staring at a monitoring dashboard 24x7 (except maybe when one is bored 😄). This is where alerting comes into play — where we configure the monitoring system to inform us (by email, text, or call) when something has gone wrong or set a few thresholds so it can notify us when something is about to go wrong. Combining metrics with alarm threshold and alerts helps to achieve “Highlighted observable” (notifications) or “Resolve observable” (fix the issues from happening).

Tracing

Distributed tracing allows for the tracking of a request from its inception at the frontend, through multiple microservices, and all the way to any database queries made. This is achieved through a unique value that is emitted by each service and collated together to form a holistic view of interactions. It provides teams with an end-to-end view of where the application is spending the most time, any errors that may be occurring, and any bottlenecks that may be forming. Some common tracing tool examples include OpenTelemetry, AWS X-Ray and Kibana. A seamless dashboard that can provide a visual display of what other services my service is linked to, how much time each request is taking as it flows through them and can help drill down into each of them in case of problems, is a life saver for operations teams. It is worth noting though that tracing of requests usually happens on a sampling basis i.e. not every transaction is traced.

Conclusion

Historically, observability was frequently overlooked and not given priority in software development. However, in recent years, there has been a significant shift. Only a few years back, at my previous workplace, we felt the importance of observability and began exploring the ELK stack to gain valuable insights into our complex product. Today, the landscape has evolved, and a plethora of software products are available for observability, including Splunk, Dynatrace, Datadog, and various cloud-native services e.g. AWS CloudWatch. Most of them are built using the same concept as OpenTelemetry.

With the advent of AI, it is interesting to see how AI services can help us in observability space, often termed as AIOps. What if an AI service could ingest all these logs and metrics from multiple sources and provide us with intelligent anomaly detection monitoring information and alerting. In fact, Amazon Lookout For Metrics seems to do that. It would be useful to keep an eye on how AI based observability services could augment team’s efforts in preventing incidents from happening and improve customer experience.

Observability should be regarded as a first-class citizen in the realm of software development to avoid the need for unexpected redesigns later. Just like a stethoscope, observability tools need to be able to reach the areas where potential problems may arise. Prioritising observability empowers developers to proactively identify and address issues, resulting in more efficient and effective software development processes.

Observability is a continuous process evolving as teams mature in their understanding of the system, what its likes and dislikes are, what makes it stressed and so what needs to be done to keep it happy and healthy (almost like a living being 😄)

Note: This article speaks only to my personal views / experiences and is not published on behalf of Deloitte LLP and associated firms, and does not constitute professional or legal advice. All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

References/Further reading

1 Apache Commons (2024) Apache Commons Logging — User Guide Available at: https://commons.apache.org/proper/commons-logging/guide.html#JCL_Best_Practices

2 Datadog (2020) Best Practices for Monitoring Authentication Logs _ Datadog Available at: https://www.datadoghq.com/blog/how-to-monitor-authentication-logs/

3 Elastic Get started with Elastic Observability | Elastic Observability [8.14] | Elastic Available at: https://www.elastic.co/guide/en/observability/current/observability-get-started.html

4 OpenTelemetry (2024) What is OpenTelemetry? | OpenTelemetry Available at: https://opentelemetry.io/docs/what-is-opentelemetry/

5 Splunk blogs (2023) MELT Explained: Metrics, Events, Logs & Traces | Splunk Available at: https://www.splunk.com/en_us/blog/learn/melt-metrics-events-logs-traces.html

6 SRE book Google — Site Reliability Engineering Available at: https://sre.google/sre-book/table-of-contents/