Note — Huge thanks to Jamie Wilkinson and Julius Volz, Google SREs present and past, for reading a draft of this and giving me invaluable suggestions. All mistakes and opinions however, are solely mine. This was, in essence, my Velocity 2017 talk.
The infrastructure space is in the midst of a paradigm-shifting change. The way organizations — from the smallest of startups to established companies — build and operate systems has evolved.
Containers, Kubernetes, microservices, service meshes, immutable infrastructure and serverless are all incredibly promising ideas which fundamentally change the way we run software. As more and more organizations move toward these paradigms, the systems we build have become more distributed and in the case of containerization, more ephemeral.
While we’re still at the stage of early adoption, with the failure modes of these new paradigms still being very nebulous and not widely advertised, these tools are only going to get increasingly better with time. Soon enough, if not already, we’ll be at that point where the network and underlying hardware failures have been robustly abstracted away from us, leaving us with the sole responsibility to ensure our application is good enough to piggy bank on top of the latest and greatest in networking and scheduling abstractions.
No amount of GIFEE (Google Infrastructure for Everyone Else) or industrial-grade service mesh is going to fix the software we write. Better resilience and failure-tolerant paradigms from off-the-shelf components now means that — assuming said off-the-shelf components have been configured correctly — most failures will arise from the application layer or from the complex interactions between different applications. Trusting Kubernetes and friends to do their job makes it more important than ever for us to focus on the vagaries of the performance characteristics of our application and business logic. We’re at a time when it has never been easier for application developers to focus on just making their service more robust and trust that if they do so, the open source software they are building on top of will pay the concomitant dividends.
In order to manoeuvre this brave new world successfully, gaining visibility into our services and infrastructure becomes more important than ever before to successfully understand, operate, maintain and evolve these.
Fortunately for us, a new crop of tools have emerged to help us rise to this challenge. While one might argue that these tools suffer from the selfsame problem they assist us in solving — viz, the tools themselves are every bit as nascent and emergent as the infrastructural paradigms they help us gain visibility into — strong community interest, community driven development and an open governance model do a lot to promote the sustainability and development of these tools.
In addition to a surge in open source tooling, commercial tooling modeled along the lines of Google, Facebook and Twitter’s internal tools have emerged to address the real need felt by the the early adopters of cloud native paradigms. Given how far both categories of tools have evolved in recent years, we now have a veritable smorgasbord of choices.
Decision Making in the Time of Cloud Native
A plethora of tools at our disposal to adopt or buy, however, presents an entirely different problem — one of decision making.
How do we choose the best tool for our needs? How do we even begin to tell the difference between these tools when several of these tools more or less do the same thing? We might’ve heard that monitoring is dead and observability is all the rage now. Does that mean we stop “monitoring”? We hear a lot about the “three pillars of observability”, but what even is observability and why should we care? What really is the difference between logs and metrics, except where we send them to? We might’ve heard a lot of tracing — but how useful can tracing really be if it’s only just a log with some context? Is a metric just a log or trace that occurs too frequently for a backend system to store? Do we really need all three of them?
I’ve said this before in one of my posts, but it bears reiterating. It’s tempting, especially when enamored by a new piece of technology that promises the moon, to retrofit our problem space with the solution space of said technology, however minimal the intersection. Before buying or building a tool, it becomes important to evaluate the maximum utility it can provide for the unique set of engineering challenges specific teams face. In particular when it comes down to choosing something as critical as a monitoring stack, in order to be able to make better technological choices, it becomes absolutely necessary for us to first fully understand:
— the strengths and weaknesses of each category of tools — the problems they solve — the tradeoffs they make — their ease of adoption/integration into an existing infrastructure
Most importantly, it becomes important to make sure that we are solving the problems at hand, not using the solutions these new crop of tools provide. Starting over from scratch isn’t a luxury most of us enjoy and the most challenging part about modernizing one’s monitoring stack is iteratively evolving it. Iterative evolution — of refactoring, if you will — of one’s monitoring stack in turn presents a large number of challenges from both a technical as well as an organizational standpoint.
The goal of this post is to shed light on these technologies and primarily frame the discussion in the context of the problems that we will be solving and the tradeoffs we might be making. It’s important for me to state upfront that the main purpose of this post isn’t to provide catch-all answers or solutions or demos of specific tools. What I hope to achieve with this post is leave you with some ideas and hopefully some questions that you can try to answer as you design systems with the goal of bringing better visibility to them.
— What even is observability and how is it different from Monitoring? — An overview of the “three pillars of modern observability”: logging, metrics collection, and request tracing — The pros and cons of each in terms of resource utilization, ease of use, ease of operation, and cost effectiveness — An honest look at the challenges involved in scaling all the three when used in conjunction — What to monitor and how in a modern cloud native environment; what is better-suited to be aggregated as metrics versus being logged; how and when to use the data from all the three sources to derive actionable alerts and insightful analysis — When it makes sense to augment the three aforementioned tools with additional tools
What to “monitor” and how in a modern cloud native environment?
This post is titled Monitoring in the time of Cloud Native. I’ve been asked why I chose to call it monitoring and not observability. I was expecting more snark about the buzzword that’s actually in the title — Cloud Native — than the one conspicuous by its absence. I chose not to call it observability for this very same reason — two buzzwords was one too many for my liking.
In all seriousness, I do believe there’s a difference between the two. The reason I believe so is because the nature of failure is changing, the way our systems behave (or misbehave) as a whole is changing, the requirements these systems need to meet are changing, the guarantees these systems need to provide are changing. In order to rise to these challenges successfully, it becomes necessary to not just change the way we build and operate software, but also gain better visibility into our services, which in turn gives us a shorter feedback loop about the performance of our services in production, which in turn enables us to build better services. In order to craft this virtuous cycle, it becomes important to understand what’s observability and how it differs from Monitoring.
When I type “monitoring” into a search engine, the first two results that come up are the following:
— observe and check the progress or quality of (something) over a period of time; keep under systematic review.
— maintain regular surveillance over.
Monitoring, to me, connotes something that is inherently both failure as well as human centric. Let’s talk a bit more about this because this forms the bedrock of this post.
In the past, we might’ve first tested our application. This might’ve been followed by a QA cycle. Then we might’ve released our code, followed by “monitoring” it. Followed by leaving a lot to chance.
To be fair, I don’t believe this was how everyone has been managing software lifecycle, but it makes for a good caricature of what’s often considered “the old way”.
We “monitored” something because we expected something to behave a certain way. What’s worse, we expected something to fail in a very specific manner and wanted to keep tabs on this specific failure. An “explicit, predictable failure” centric approach to monitoring becomes a problem when the number of failure modes both increases and failure itself becomes more implicit.
As we adopt increasingly complex architectures, the number of “things that can go wrong” exponentially increases. We often hear that we live in an era when failure is the norm. The SRE book states that:
It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the number of features a team can afford to offer.
Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.
Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure. In other words, this means turning hard, explicit failure modes into partial, implicit and soft failure modes. Failure modes that could be papered over with graceful degradation mechanisms like retries, timeouts, circuit breaking and rate limiting. Failure modes that can be tolerated owing to relaxed consistency guarantees with mechanisms like eventual consistency or aggressive multi-tiered caching. Failure modes that can be even triggered deliberately with load shedding in the event of increased load that has the potential to take down our service entirely, thereby operating in a degraded state.
But all of this comes at the cost of increased overall complexity and the buyer’s remorse often acutely experienced is the loss of ability to easily reason about systems.
Which brings me to the second characteristic of “monitoring” — in that it’s human centric. The reason we chose to “monitor” something was because we knew or suspected something could go wrong, and that when it did go wrong there were consequences. Real consequences. High severity consequences that needed to be remedied as soon as possible. Consequences that needed human intervention.
I’m not someone who believes that automating everything is a panacea, but the advent of platforms like Kubernetes means that several of the problems that human and failure centric monitoring tools of yore helped “monitor” are already solved. Health-checking, load balancing and taking failed services out of rotation and so forth are features these platforms provide for free. That’s their primary value prop.
With more of the traditional monitoring responsibilities being automated away, “monitoring” has become — or will soon be — less human centric. While none of these platforms will truly make a service impregnable to failure, if used correctly, they can help reduce the number of hard failures, leaving us as engineers to contend with the subtle, nebulous, unpredictable behaviors our system can exhibit. The sort of failures that are far less catastrophic but ever more numerous than before.
Which then begs the question — how do we design monitoring for such systems?
It really isn’t even so much about how to design monitoring for these systems, than how to design the systems themselves.
I’d argue that “monitoring” should still be both hard failure as well as human centric, even in this brave new world. The goal of “monitoring” hasn’t changed, even if the scope has shrunk drastically, and the challenge that now lies ahead of us is identifying and minimizing the bits of “monitoring” that still remain human centric. We need to design our systems such that only a small sliver of the overall failure domain is now of the hard, urgently human actionable sort.
But there’s a paradox. Minimizing the number of “hard, predictable” failure modes doesn’t in any way mean that the system itself as a whole is any simpler. In other words, even as infrastructure management becomes more automated and requiring less human elbow grease, application lifecycle management is becoming harder. As the number of hard failure modes shrink at the expense of a drastic rise in implicit failure modes and overall complexity, “monitoring” every failure explicitly becomes infeasible, and not to mention, quite unnecessary.
Observability, in my opinion, is really about being able to understand how a system is behaving in production. If “monitoring” is best suited to report the overall health of systems, “observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for providing visibility into implicit failure modes and on the fly generation of information required for debugging. Monitoring is being on the lookout for failures, which in turn requires us to be able to predict these failures proactively. An observable system is one that exposes enough data about itself so that generating information (finding answers to questions yet to be formulated) and easily accessing this information becomes simple.
An interlude — Blackbox Monitoring
For the uninitiated, blackbox monitoring refers to the category of monitoring derived by treating the system as a blackbox and examining it from the outside. While some believe that with more sophisticated tooling at our disposal blackbox monitoring is a thing of the past, I’d argue that blackbox monitoring still has its place, what with large parts of core business and infrastructural components being outsourced to third-party vendors. While the amount of control we might have over the performance of the vendor might be limited, having visibility into how services we own are impacted by the vagaries of outsourced components becomes exceedingly crucial insofar as it affects our system’s performance as a whole.
Even outside of third-party integrations, treating our own systems as blackboxes might still have some value, especially in a microservices environment where different services owned by different teams might be involved in servicing a request. In such cases, being able to communicate quantitatively about systems paves the way toward establishing SLOs for different services.
It seems pragmatic for individual teams to treat services owned by other teams as blackboxes. This enables individual teams to design better integrations with other systems owned by different teams based on the contracts they expose and guarantees they offer.
Whitebox Monitoring versus Observability
“Whitebox monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. Whitebox monitoring isn’t really a revolutionary idea anymore. Time series, logs and traces are all more in vogue than ever these days and have been for a few years.
So then. Is observability just whitebox monitoring by another name?
Well, not quite.
Data and Information
The difference between whitebox monitoring and observability really is the difference between data and information. The formal definition of information is:
Data are simply facts or figures — bits of information, but not information itself. When data are processed, interpreted, organized, structured or presented so as to make them meaningful or useful, they are called information. Information provides context for data.
The distinction between monitoring and observability isn’t just about if the data is being reported from the bowels of the system or if it’s collected via treating the system as a blackbox. The distinction, in my opinion, is more purpose-driven, than origin-driven. It’s not so much about where this data comes from than what we plan to do with it and how easily we can achieve this.
Whitebox monitoring is a fantastic source of data. Observability is our ability to easily and quickly find germane information from this data when we need it. Observability is more about what information we might require about the behavior of our system in production and whether we will be able to have access to this information. It doesn’t matter so much if this information is pre-processed or if it’s derived from the data on the fly. It’s also not about how we plan to process and use the raw data. This raw data we’re collecting could have various uses —
- We could use the data we’re gathering to be on the lookout for explicit failure modes that have high severity consequences — an imminent outage, in other words — that we’re trying to stave off or firefight, in which case we’re using the data to alert based on symptoms.
- We could use this data to know the overall health of a service, in which case we’re thinking in terms of overviews.
Both of these cases, I’d argue, fall under “monitoring”.
- We could use this data to debug rare and/or implicit failure modes that we couldn’t have predicted beforehand. In which case, we’re using the data to debug our systems.
- We could also use the data for purposes like profiling to derive better understanding about the behavior of our system in production even during the normal, steady state, in which case, we’re using the data to understand our system as it exists today.
- We might also want to understand how our service depends on other services currently, so as to enable us to understand if our service is being impacted by another service, or worse, if we are contributing to the poor performance of another service, in which case we’re using this data for dependency analysis.
- We could also aim to be more ambitious and make sure our system is functional not just right now but also ensure we have the data to understand the behavior of our system so that we can work on evolving it and maintaining it, not just tomorrow but for the duration of its entire lifecycle. While it’s true that solving tomorrow’s problems should not be our goal for today, it’s still important to be cognizant of them. There’s nothing worse than being blindsided by a problem only to realize we could’ve done better had we had better visibility into it sooner. We can anticipate the known, hard failure modes of today and “monitor” for them, but the known, hard failure modes of tomorrow most often don’t exhibit themselves in a very explicit manner today. They need to be teased from subtle behaviors only exhibited by our system during certain anomalous situations or under certain traffic patterns that might be rare or not a cause of concern or immediately actionable today.
These are all different goals. We could optimize for some of these or maybe even all of these. Whitebox monitoring is an important component (possibly the most important component) that helps us achieve all of these aforementioned goals, but whitebox monitoring, per se, isn’t observability.
Different organizations might have different requirements for what falls under “monitoring”. For some, dependency analysis might be an active part of their “monitoring”. For others, security auditing might be an indispensable part of their Monitoring goals. As such, I see observability as a spectrum and something constantly evolving as a service evolves.
Another way of looking at what falls under “monitoring” as opposed to what’s “observability” is by differentiating what we do reactively as opposed to what we do proactively.
Again, this might be different for different organizations, but I think it’s important to differentiate between the two purposes. Proactively generating information from data because we feel we might need them at all times is different from generating information on the fly at the time of debugging or analysis from data proactively collected.
Yet another way of looking at this spectrum is to perhaps distinguish based on the Ops versus Dev responsibilities. I see “monitoring” as something that requires being on-call. Observability I see as something that mandatorily requires developer/software engineer participation.
Furthermore, it’s worth noting at there’s a cause/effect relationship at play here in the spectrum. A lot of times what people “monitor” at one layer (metrics like error rate and duration of request) are often the “symptoms”, with the cause being several layers down the spectrum.
Being able to troubleshoot a problem involves often starting with a symptom reported by a coarse-grained metric (increased error rate or response time) or a trace (service B is slow for certain types of requests from downstream service C) that provides a bird’s eye view of the problem and then iteratively drilling down to corroborate or invalidate our theories thereby reducing the search space at every iteration, until we finally reach the information needed to until we arrive at the root cause.
Which brings me to my next point about —
Observability isn’t just about data collection
While having access to data becomes a requirement if we wish to derive information from it, observability isn’t just about data collection alone. Once we have the data, it becomes important to be able to get answers/information from this data easily.
While it’s true that raw data is more malleable than pre-processed information, deferring processing of information until we actually need it incurs other overheads, namely that of collection, storage and on-the-fly processing. While it might sound all very well in theory to state that implementation details don’t matter so long as we can get to our observability goals, how the data that is being gathered can be best processed and stored becomes a key practical consideration if we wish to achieve the dream of establishing and sustaining the virtuous cycle. Usability of data becomes a key concern as well, as does the barrier to data collection.
And lastly, I’d argue that the most overarching aspect of observability isn’t data collection or processing. Having data at our disposal alone doesn’t solve problems. Problem solving also involves the right amount of engineering intuition and domain experience to ask the right questions of the data to be able to get to the bottom of it. In fact, good observability isn’t possible without having good engineering intuition and domain knowledge, even if one had all the tools at one’s disposal. And that really is what the rest of this post aims to address, by hopefully giving you some food for thought in terms of how to build systems to make it possible to gain better insights from them.
The Three Pillars of Observability
A more concrete example would help us understand logs, metrics and traces better. Let us assume the architecture of our system or sub-system looks like the following:
A log is an immutable record of discrete events that happened over time. Some people take the view that events are distinct compared to logs, but I’d argue that for all intents and purposes they can be used interchangeably.
Event logs in general come in three forms:
1. Plaintext — A log record might take the form of free-form text. This is also the most common format of logs.
2. Structured — Much evangelized and advocated for in recent days. Typically this is logs emitted in the JSON format.
3. Binary — think logs in the Protobuf format, MySQL binlogs used for replication and point-in-time recovery, systemd journal logs, the
pflogformat used by the BSD firewall
pf which often serves as a frontend to
Logs, in particular, shine when it comes to providing valuable insight along with ample context into the long tail that averages and percentiles don’t surface. Coming back to the example we saw above, let us assume that all of these various services also emit logs at varying degrees of granularity. Some services might emit more log information per request than others. Looking at logs alone, our data landscape might look like the following:
The very first thing that jumps out to me when I look at the above diagram is abundance of data points. Recording anything and everything that might be of interest to us becomes incredibly useful when we are searching at a very fine level of granularity, but simply looking at this mass of data, it’s impossible to infer at a glance what the request lifecycle was or even which systems the request traversed through or even the overall health of any particular system. Sure, the data might be rich but without further processing, it’s pretty impenetrable.
What we require, in short, is information. The interesting aspect of information in the context of this discussion is what information we’re looking for? Do we want information about the lifecycle of a request? Or do we want information about the resource utilization of a specific service? Or do we want information about the health of a specific host? Or do we want information about why a specific service crashed? Or do we want information about the replication lag in a distributed key value store? Or are we looking for information about how long it took an eventually consistent system to converge? Or are we looking for information about GC pauses? Or are we trying to glean information about the symptoms or are we trying to find the root cause? There is, quite frankly, an endless amount of data points we can collect and an endless number of questions we can answer, from the most trivial to the most difficult.
Two very important pieces of information, however, pertains to the fate of requests throughout their lifecycle (which is usually short lived) and the fate of a system as a whole (measured over a duration that is orders of magnitudes longer than request lifecycles). I see both traces and metrics as an abstraction built on top of logs that pre-process and encode information along two orthogonal axes, one being request centric, the other being system centric.
A trace is a representation of a series of causally-related distributed events that encode the end-to-end request flow through a distributed system. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. The path of a request allows us to understand the services involved in the servicing of a request, and the structure of a trace helps one understand the junctures and effects of asynchrony in the execution of a request.
Albeit discussions around tracing pivot around their utility in a microservices environment, I think it’s fair to suggest that any sufficiently complex application that interacts with — or rather, contends for — resources such as the network or disk in a non-trivial manner can benefit from the benefits tracing can provide.
The basic idea behind tracing is straightforward — identify specific points in an application, proxy, framework, library, middleware and anything else that might lie in the path of execution of a request, instrument these points and have these coordinate with each other. These points are of particular interest since they represent forks in execution flow (OS thread or a green thread) or a hop or a fan out across network or process boundaries that a request might encounter in the course of its lifecycle.
Usually represented as a directed acyclic graph, they are used to identify the amount of work done at each layer while preserving causality using happens-before semantics. The way this is achieved is by adding instrumentation to specific points in code. When a request begins, it’s assigned a globally unique ID, which is then propagated throughout the request path, so that each point of instrumentation is able to insert or enrich metadata before passing the ID around to the next hop in the meandering flow of a request. When the execution flow reaches the instrumented point at one of these services, a record is emitted along with metadata. These records are usually asynchronously logged to disk before being submitted out of band to a collector, which then can reconstruct the flow of execution based on different records emitted by different parts of the system.
Collecting this information and reconstructing the flow of execution while preserving causality for retrospective analysis and troubleshooting enables one to understand the lifecycle of a request better. Most importantly, having an understanding of the entire request lifecycle makes it possible to debug requests spanning multiple services to pinpoint the source of increased response time or resource utilization. As such, traces largely help one understand the which and sometimes even the why — like which component of a system is even touched during the lifecycle of a request and is slowing the response?
The official definition of metrics is:
a set of numbers that give information about a particular process or activity.
Metrics are a numeric representation of our data and as such can fully harness the power of mathematical modeling and prediction to derive knowledge of the behavior of our system over intervals of time in the present and future— in other words, a time series. The official definition of time series :
a list of numbers relating to a particular activity, which is recorded at regular periods of time and then studied. Time series are typically used to study, for example, sales, orders, income, etc.
Metrics are just numbers measured over intervals of time, and numbers are optimized for storage, processing, compression and retrieval. As such, metrics enable longer retention of data as well as easier querying, which can in turn be used to build dashboards to reflect historical trends. Additionally, metrics better allow for gradual reduction of data resolution over time, so that after a certain period of time data can be aggregated into daily or weekly frequency.
One of the biggest drawback of historical time series databases has been the identification of metrics which didn’t lend itself very well toward exploratory analysis or filtering. The hierarchical metric model and the lack of tags or labels in systems like Graphite especially hurt in this regard. Modern monitoring systems like Prometheus represent every time series using a metric name as well as additional key-value pairs called labels.
This allows for a high degree of dimensionality in the data model. A metric is identified using both the metric name and the labels. Metrics in Prometheus are immutable; changing the name of the metric or adding or removing a label will result in a new time series. The actual data stored in the time-series is called a sample and it consists of two components — a float64 value and a millisecond precision timestamp.
The pros and cons of each in terms of resource utilization, ease of use, ease of operation, and cost effectiveness
Let’s evaluate each of the three in terms of three criteria before we see how we can leverage the strengths of each to craft a great observability experience:
— Ease of generation/instrumentation — Ease of processing — Ease of querying/searching — Quality of information — Cost Effectiveness
Logs are, by far, the easiest to generate since there is no initial processing involved. The fact that it is just a string or a blob of JSON makes it incredibly easy to represent any data we want to emit in the form of a log line. Most languages, application frameworks and libraries come with in built support for logging. Logs are also easy to instrument since adding a log line is quite as trivial as adding a print statement. Logs also perform really well in terms of surfacing highly granular information pregnant with rich local context that can be great for drill down analysis, so long as our search space is localized to a single service.
The utility of logs, unfortunately, ends right there. Everything else I’m going to tell you about logs is only going to be painful. While log generation might be easy, the performance idiosyncrasies of various popular logging libraries leave a lot to be desired. Most performant logging libraries allocate very little, if any, and are extremely fast. However, the default logging libraries of many languages and frameworks are not the cream of the crop, which means the application as a whole becomes susceptible to suboptimal performance due to the overhead of logging. Additionally, log messages can also be lost unless one uses a protocol like RELP to guarantee reliable delivery of messages. This becomes especially important if one is using log data for billing or payment purposes. Lastly, unless the logging library can dynamically sample logs, logging has the capability to adversely affect application performance as a whole. As someone mentioned on a Slack:
A fun thing I had seen while at [redacted] was that turning off most logging almost doubled performance on the instances we were running on because logs ate through AWS’ EC2 classic’s packet allocations like mad. It was interesting for us to discover that more than 50% of our performance would be lost to trying to control and monitor performance.
On the processing side, raw logs are almost always normalized, filtered and processed by a tool like Logstash, fluentd, Scribe or Heka before they’re persisted in a data store like Elasticsearch or BigQuery. If an application generates a large volume of logs, then the logs might require further buffering in a broker like Kafka before they can be processed by Logstash. Hosted solutions like BigQuery have quotas you cannot exceed. On the storage side, while Elasticsearch might be a fantastic search engine, there’s a real operational cost involved in running it. Even if your organization is staffed with a team of Operations engineers who are experts in operating ELK, there might be other drawbacks. Case in point — one of my friends was telling me about how he would often see a sharp downward slope in the graphs in Kibana, not because traffic to the service was dropping but because ELK couldn’t keep up with the indexing of the sheer volume of data being thrown at it. Even if log ingestion processing isn’t an issue with ELK, no one I know of seems to have fully figured out how to use Kibana’s UI, let alone enjoy using it.
While there is no dearth of hosted commercial offerings for log management, they are probably better known for their obscene pricing. The fact that a large number of organizations choose to outsource log management despite the cost is a testament to how operationally hard, expensive and fragile running it in-house is.
An antidote often proposed to the problem of the cost overhead of logging is to sample or to only log actionable data. But even when sampled aggressively, it requires us to make decisions a priori as to what might be actionable. As such, our ability to log “actionable” data is entirely contingent on our ability to be able to predict what will be actionable or what data might be needed in the future. While it’s true that better understanding of a system might allow us to make an educated guess as to what data now gathered can prove to be a veritable source of information in the future, potentially every line of code is point of failure and as such could become the source of a log line.
By and large, the biggest advantage of metrics based monitoring over logs is the fact that unlike log generation and storage, metrics transfer and storage has a constant overhead. Unlike logs, the cost of metrics doesn’t increase in lockstep with user traffic or any other system activity that could result in a sharp uptick in data.
What this means is that with metrics, an increase in traffic to an application will not incur a significant increase in disk utilization, processing complexity, speed of visualization and operational costs the way logs do. Metrics storage increases with the number of time series being captured (when more hosts/containers are spun up, or when new services get added or when existing services are instrumented more), but unlike statsd clients that send a UDP packet every time a metric is recorded to the statsd daemon (resulting in a directly proportional increase in the number of metrics being submitted to statsd compared to the traffic being reported on!), client libraries of systems like Prometheus aggregate time series samples in-process and submit them to the Prometheus server upon a successful scrape (which happens once every few seconds and can be configured).
Metrics, once collected, are also more malleable to mathematical, probabilistic and statistical transformations such as sampling, aggregation, summarization and correlation, which make it better suited to report the overall health of a system.
Metrics are also better suited to trigger alerts, since running queries against an in-memory time series database is far more efficient, not to mention more reliable, than running a query against a distributed system like ELK and then aggregating the results before deciding if an alert needs to be triggered. Of course, there are systems that strictly query only in-memory structured event data for alerting that might be a little less expensive than ELK, but the operational overhead of running large distributed in-memory databases, even if they were open source, isn’t something worth the trouble for most when there are far easier ways to derive equally actionable alerts. Metrics are akin to blackbox frontends of a system’s performance and as such are best suited to furnish this information.
The biggest drawback with both logs and metrics is that they are system scoped, making it hard to understand anything else other than what’s happening inside of a particular system. Sure, metrics can also be request scoped, but that entails a concomitant increase in label fanout which results in an increase in storage. While the new Prometheus storage engine has been optimized for high churn in time series, it’s also true that metrics aren’t the best suited for highly granular request scoped information. With logs, without fancy joins, a single log line or metric doesn’t give much information about what happened to a request across all components of a system. Together and when used optimally, logs and metrics give us complete omniscience into a silo, but nothing more. While these might be sufficient for understanding the performance and behavior of individual systems — both stateful and stateless — they come a cropper when it comes to understanding the lifetime of a request that traverses through multiple systems.
Tracing captures the lifetime of requests as they flow through the various components of a distributed system. The support for enriching the context that’s being propagated with additional key value pairs makes it possible to encode application specific metadata in the trace, which might give developers more debugging power.
The use cases of distributed tracing are myriad. While used primarily for inter service dependency analysis, distributed profiling and debugging steady-state problems, tracing can also help with chargeback and capacity planning.
Tracing is, by far, the hardest to retrofit into an existing infrastructure, owing to the fact that for tracing to be truly effective, every component in the path of a request needs to be modified to propagate tracing information. Depending on whom you ask, you’d either be told that gaps in the flow of a request doesn’t outweigh the cons or be told that these gaps are blind spots that make debugging harder.
We’ve been implementing a request tracing service for over a year and it’s not complete yet. The challenge with these type of tools is that, we need to add code around each span to truly understand what’s happening during the lifetime of our requests. The frustrating part is that if the code is not instrumented or header is not carrying the id, that code becomes a risky blind spot for operations.
The second problem with tracing instrumentation is that it’s not sufficient for developers to instrument their code. A large number of applications in the wild are built using open source frameworks or libraries which might require additional instrumentation. This becomes all the more challenging at places with polyglot architectures, since every language, framework and wire protocol with widely disparate concurrency patterns and guarantees need to cooperate. Indeed, tracing is most successfully deployed in organizations where there are a core set of languages and frameworks used uniformly across the company.
The cost of tracing isn’t quite as catastrophic as that of logging, mainly owing to the fact that traces are almost always sampled heavily to reduce runtime overhead as well as storage costs. Sampling decisions can be made:
— at the start of a request before any traces are generated — at the end once all participating systems have recorded the traces for the entire course of the request execution — midway through the request flow, when only downstream services would then report the trace
Given the aforementioned characteristics of logs, any talk about best practices for logging inherently embodies a tradeoff. There are a couple of approaches that I think can help alleviate the problem on log generation, processing, storage and analysis.
We either log everything that might be of interest and pay a processing and storage penalty, or we log selectively, knowing that we are sacrificing fidelity but making it possible to still have access to important data. Most talk around logging revolves around log levels, but rarely have I seen quotas imposed on the amount of log data a service can generate. While Logstash and friends do have plugins for throttling log ingestion, most of these filters are based on keys and certain thresholds, with throttling happening after the event has been generated.
If logging is provided as an internal service — and there are many companies where this is the case — then establishing service tiers with quotas and priorities can be a first step. Any user facing request or service gets assigned the highest priority, while infrastructural tasks or background jobs or anything that can tolerate a bounded delay are lower on the priority list.
With or without quotas, it becomes important to be able to dynamically sample logs, so that the rate of log generation can be adjusted on the fly to ease the burden on the log forwarding, processing and storage systems. In the words of the aforementioned acquaintance who saw a 50% boost by turning off logging on EC2:
The only thing it kind of convinced me to is the need for the ability to dynamically increase or decrease logging on a per-need basis. But the caveat there is always that if you don’t always run the full blown logging, eventually the system can’t cope to run with it enabled.
Logging is a Stream Processing Problem
Data isn’t only ever used for application performance and debugging use cases. It also forms the source of all analytics data as well. This data is often of tremendous utility from a business intelligence perspective, and usually businesses are willing to pay for both the technology and the personnel required to make sense of this data in order to make better product decisions.
The interesting aspect to me here is that there are striking similarities between questions a business might want answered and questions we might want answered during debugging. For example, a question that might be of business importance is the following:
Filter to outlier countries from where users viewed this article fewer than 100 times in total.
Whereas, from a debugging perspective, the question might look more like:
Filter to outlier page loads that performed more than 100 database queries.
Or, show me only page loads from Indonesia that took more than 10 seconds to load.
While these aren’t similar queries from a technical perspective, the infrastructure required to perform these sort of analysis or answer these kinds of queries is largely the same.
What might of interest to the business might be the fact that:
User A viewed Product X.
Augmenting this data with some extra information might make it ripe for observability purposes:
User A viewed Product X and the page took 0.5s to load User A viewed Product X whose image was not served from cache User A viewed Product X and read review Z for which the response time for the API call was 300ms.
Both these queries are made possible by events. Events are essentially structured (optionally typed) key value pairs. Marrying business information along with information about the lifetime of the request (timers, durations and so forth) makes it possible to repurpose analytics tooling for observability purposes.
If you think about this, log processing neatly fits into the bill of Online Analytics Processing (OLAP). Information derived from OLAP systems is not very different compared to information derived for debugging or performance analysis or anomaly detection at the edge of the system. Most analytics pipelines use Kafka as an event bus. Sending enriched event data to Kafka allows one to search in real time over streams with KSQL, a streaming SQL engine for Kafka from the fine folks at Confluent.
KSQL supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more. The Kafka log is the core storage abstraction for streaming data, allowing same data that went into your offline data warehouse is to now be available for stream processing. Everything else is a streaming materialized view over the log, be it various databases, search indexes, or other data serving systems in the company. All data enrichment and ETL needed to create these derived views can now be done in a streaming fashion using KSQL. Monitoring, security, anomaly and threat detection, analytics, and response to failures can be done in real-time versus when it is too late. All this is available for just about anyone to use through a simple and familiar SQL interface to all your Kafka data: KSQL.
Enriching business events that go into Kafka anyway with additional timing and other metadata required for observability use cases can be helpful when repurposing existing stream processing infrastructures. A further benefit this pattern provides is that this data can be expired from the Kafka log regularly. Most event data required for debugging purposes are only valuable for a relatively short period of time after the event has been generated, unlike any business centric information that normally would’ve been evaluated and persisted by an ETL job. Of course, this makes most sense when Kafka already is an integral part of an organization. Introducing Kafka into a stack purely for real time log analytics is a bit of an overkill, especially in non-JVM shops without any significant JVM operational expertise.
A new hope for the future
The fact that logging still remains an unsolved problem makes me wish for an OpenLogging spec, in the vein of OpenTracing which serves as a shining example and a testament to the power of community driven development. A spec designed ground up for the cloud-native era that introduces a universal exposition as well as a propagation format. A spec that enshrines that logs must be structured events and codifies rules around dynamic sampling for high volume, low fidelity events. A spec that can be implemented as libraries in all major languages and supported by all major application frameworks and middleware. A spec that allows us to make the most of advances in stream processing. A spec that becomes the lingua franca logging format of all CNCF projects, especially Kubernetes.
Prometheus is much more than just the server. I see Prometheus as a set of standards and projects, with the server being just one part of a much greater whole.
Prometheus does a great job of codifying the exposition format for metrics, and I’d love to see this become the standard. While Prometheus doesn’t offer long term storage, the remote write feature that was added to Prometheus about a year ago allowed one to write Prometheus metrics to a custom remote storage engine like OpenTSDB or Graphite, effectively turning Prometheus into a write-through cache. With the recent introduction of the generic write backend, one can transport time-series from Prometheus over HTTP and Protobuf to any storage system like Kafka or Cassandra.
Remote reads, though, is slightly newer and I’ve only been seeing efforts coalesce into something meaningful in the last few months. InfluxDB now natively supports both Prometheus remote reads and writes. Remote reads allows Prometheus to read raw samples from a remote backend during query execution time and compute the results in the Prometheus server.
Furthermore, improvements to the Prometheus storage engine in the upcoming 2.0 release makes Prometheus all the more conducive to cloud-native workloads with vast churn in time-series names. The powerful query language of Prometheus coupled with the ability to define alerts using the same query language and enrich the alerts with templated annotations makes it perfect for all “monitoring purposes”.
With metrics, however, it’s important to be careful not to explode the label space. Labels should be so chosen so that it remains limited to a small set of attributes that can remain somewhat uniform. It also becomes important to resist the temptation to alert on everything. For alerting to be effective, it becomes salient to be able to identify a small set of hard failure modes of a system. Some believe that the ideal number of signals to be “monitored” is anywhere between 3–5, and definitely no more than 7–10. One of the common pain points that keeps cropping up in my conversations with friends is how noisy their “monitoring” is. Noisy monitoring leads to either metric data that’s never looked at — which in other words is a waste of storage space of the metrics server — or worse, false alerts leading to a severe case of alert fatigue.
While historically tracing has been difficult to implement, the rise of service meshes make integrating tracing functionality almost effortless. Lyft famously got tracing support for all of their applications without changing a single line of code by adopting the service mesh pattern. Service meshes help with the DRYing of observability by implementing tracing and stats collections at the mesh level, which allows one to treat individual services as blackboxes but still get incredible observability onto the mesh as a whole. Even with the caveat that the applications forming the mesh need to be able to forward headers to the next hop in the mesh, this pattern is incredibly useful for retrofitting tracing into existing infrastructures with the least amount of code change.
When it makes sense to augment the three aforementioned tools with additional tools
Exception trackers (I think of these as logs++) have come a long way in the last few years and provide a far superior UI than a plaintext file or blobs of JSON to inspect exceptions. Exception trackers also provide full tracebacks, local variables, inputs at every subroutine or method invocation call, frequency of occurrence of the error/exception and other metadata invaluable for debugging. Exception trackers aim to do one thing — track exceptions and application crashes — and they tend to do this really well. While they don’t eliminate the need for logs, exception trackers augment logs — if you’ll pardon the pun — exceptionally well.
Some new tools also help achieve visibility by treating the network packets as the source of truth and using packet capture to build the overall service topology. While this definitely has less overhead than instrumenting all application code throughout the stack, it’s primarily useful for analyzing network interactions between different components. While it cannot help with debugging issues with the asynchronous behavior of a multithreaded service or unexpected event loop stalls in a single threaded service, augmenting it with metrics or logs to better understand what’s happening inside a single service can help one gain enough visibility into the entire architecture.
Observability isn’t quite the same as monitoring. Observability connotes something more holistic and encompasses “monitoring”, application code instrumentation, proactive instrumentation for just-in-time debugging and a culture of more thorough understanding of various components of the system.
Observability means having the ability — and the confidence — to be able to build systems knowing that these systems can turn into a frankensystem in production. It’s about understanding that the software we’re building can be — and almost always is — broken (or prone to break soon) to varying degrees despite our best efforts. A good analogy between getting code working on one’s laptop or in CI to having code running in production would be the difference between swimming in an indoor pool versus swimming in choppy rivers full of piranhas. The feeling of being unable to fix one’s own service running in a foreign environment for the want to being able to debug isn’t acceptable, not if we want to pride ourselves on our uptime and quality of service.
I want to conclude this post with how I think software development and operation should happen in the time of cloud-native.
It’s important to understand that testing is a best effort verification of the correctness of a system as well as a best effort simulation of failure modes. Unit tests only ever test the behavior of a system against a specified set of inputs. Furthermore, tests are conducted in very controlled (often heavily mocked) environments. While the very few who do fuzz their code benefit from having their code tested against a set of randomly generated input, fuzzing can only comprehensively test against the set of inputs to one service. End-to-end testing might allow for some degree of holistic testing of the system and fault injection/chaos engineering might help us gain a reasonable degree of confidence about our system’s ability to withstand these failures, but complex systems fail in complex ways and there’s is no testing under the sun that enables one to predict every last vector that could contribute towards a failure.
Despite these shortcomings, testing is as important as ever. If nothing else, testing our code allows us to write better and more maintainable code. More importantly, research has proven that something as simple as “testing error handling code could have prevented 58% of catastrophic failures” in many distributed systems. The renaissance of tooling aimed to understand the behavior of our services in production does not obviate the need for testing.
Testing in Production
Testing in production isn’t really a very new idea. Methodologies such as A/B testing, canary deployments, dark traffic testing (some call this shadowing) have been around for a while.
Being able to test in production however absolutely requires that the release can be halted and rolled back if the need arises. This in turn means that one can only test in production if one has a quick feedback loop about the behavior of the system one’s testing in production. It also means being on the lookout for changes to key performance indicators of the service. For an HTTP service this could mean attributes like error rate and latencies of key endpoints. For a user facing service, this could additionally mean a change in user engagement. Testing in production essentially means proactively “monitoring” the change in production. Which brings me to my next point.
Monitoring isn’t dead. Monitoring, in fact, is so important that I’d argue it occupies the pride of place in your observability spectrum.
In order to test in production, one needs good, effective monitoring. Monitoring that is both failure centric (in that we proactively monitor for changes to KPI’s) as well as human centric (we want the developer who pushed out the change to test in production to be alerted as soon as possible).
I chose to call this Tier I Monitoring, for the want of better word, since I believe these are table stakes. It’s the very minimum any service thats going to be in production needs to have. It’s what alerts are derived from and I believe that time-series metrics are the best suited for this purpose.
However, there are several other bits and pieces of information we might capture but not use for alerting. There’s a school of thought that all such information isn’t of much value and needs to be discarded. I, however, believe that this is the sort of information I often find myself requiring often enough that I want it presented to me in the form of a dashboard.
A good example of this sort of tier II monitoring would be this dashboard of GitLab’s which is aptly named fleet overview or this one which gives information about the running Go processes. I picked these examples because GitLab is famously known for its transparency and these are real, live production dashboards of a real company. I find analyzing these dashboards more interesting than cooking up toy examples for the purpose of a blog post.
While these metrics don’t particularly help with debugging of gremlins or problems we don’t even know exist, having such dashboards gives me a bird’s eye view of the system, which I find invaluable especially after a release, since it gives me extremely quick feedback about how known key metrics might have been impacted by the change, but weren’t severe enough to trigger an alert. Measuring heap usage for a potential memory leak would be a good example of such Tier II monitoring. I would really like to know if I pushed out a code that’s leaking memory, but I don’t consider it something I necessarily want to be alerted on.
Then there’s exploration, which I find useful to answer questions one could not have proactively thought about. This often involves querying of raw events or log data rich in context and is extremely powerful for surfacing answers we couldn’t have predicted beforehand.
The problem with all of the three approaches seen until now is that they require that we record information about our systems a priori. What this means is the the data we need is generated before we can derive any useful information from it.
Dynamic instrumentation techniques aren’t new. However, implementations like DTrace were primarily machine centric and mostly correlate events that remain confined to an address-space or specific machine. Recent academic research has married these ideas with some of the ideas pioneered by distributed tracing, allowing one to “to obtain an arbitrary metric at one point of the system,while selecting, filtering, and grouping by events meaningful at other parts of the system, even when crossing component or machine boundaries”.
The primary breakthrough the Pivot Tracing paper proposed is the baggage abstraction.
Baggage is a per-request container for tuples that is propagated alongside a request as it traverses thread, application and machine boundaries. Tuples follow the request’s execution path and therefore explicitly capture the happened-before relationship. Using baggage, Pivot Tracing efficiently evaluates happened-before joins in situ during the execution of a request.
The idea of baggage propagation has been incorporated into the OpenTracing spec, which now enables “arbitrary application data from a mobile app can make it, transparently, all the way into the depths of a storage system”. While this still isn’t quite the same as what the whitepaper describes, it still gets us one step closer to true end-to-end tracing and the ability to dynamically enrich tracing data for better visibility. Facebook’s Canopy further takes ideas pioneered by the Pivot Tracing paper and marries it with an underlying event model pioneered by Scuba, making exploration of data more dynamic than ever.
And finally there are the unknowables.
Things we can’t know about or don’t need to know about. Even if we have complete omniscience into our application and hardware performance, it’s simply not feasible — or required — to have complete visibility into the various layers of abstractions underneath the application layer. Think ARP packet losses, BGP announcements or recursive BGP lookups, OSFP states and all manner of other implementation details of abstractions we rely on without a second thought. We simply have to get comfortable with the fact that there are things we possibly cannot know about, and that it’s OK.
Which brings me to my final point —
Choose your own Observability Adventure
Observability — in and of itself, and like most other things — isn’t particularly useful. The value derived from the observability of a system directly stems from the business value derived from that system.
For many, if not most, businesses, having a good alerting strategy and time-series based “monitoring” is probably all that’s required to be able to deliver on the business goals. For others, being able to debug needle-in-a-haystack type of problems might be what’s needed to generate the most business value.
Observability, as such, isn’t an absolute.
Pick your own observability target based on the requirements of your service.