Thoughts from the Front-line: Why Wavefront?
For correctness, I should title this with “Why Tanzu Observability by Wavefront?” =p
After almost a two-year hiatus (and as a result of me finally having some time to pause and reflect on the larger story of Wavefront), we return to the fundamental question of Why Wavefront? (shamelessly borrowed from our ex-colleague Matthew Zeier)
Why Wavefront?
Our ex-Director of Operations, Matthew Zeier (whom we all refer to as mrz) explains the platform through a series of answers to the question “Why Wavefront?”. To engineers who have operated any software at scale, the concept of measuring something and observing it overtime shouldn’t be a new concept. The question one should ask is why could one build a company and a business around it.
If one is familiar with the plethora of “monitoring” solutions (the gamut spans from open source tools like graphite to commercial ones such as New Relic), another question that one could ask is whether there is any white space for building something and having customers pay “additional” money for it (let’s face it, most shops already have some sort of monitoring).
When we started ideating about Wavefront, logs (free-form, unstructured text, mostly indexed) were the primary means of observing a production system (generated by logging frameworks or System.out.println()). There were some nascent efforts in the open-source world to build time-series databases (Graphite and OpenTSDB come to mind). Still, they were very much limited to infrastructure metrics (CPU, memory, disk, and network usage). The idea of using time-series to observe code is just starting to percolate in the developer world with things like Yammer metrics. This trend was somewhat driven by the log-less culture in Google (where logs refer to structured, binary blobs, while stdout logs are discarded rapidly after collection). As engineers moved around companies, they started to influence their new companies on how to operate their respective software stacks. It’s the dawn of a metrics “decade,” and we are mostly still in this industry-wide journey of recognizing what should be collected, named, visualized, alerted, and retained in the world of time-series data for operations.
Wavefront was a journey that four of us undertook in 2013 without knowing full-well whether we could in-fact build something fundamentally better than the open-source offerings or whether there would be a market large enough to sustain a business. We, however, knew that the multi-billion dollar IT monitoring market existed even back then, and hence it was primarily a question of “can we build it?” and not so much “if we can build it, will they come?” (meaning if we could build something substantially better, we believe we can displace existing players at the very least).
Why Metrics?
The word “metric” is heavily overloaded as you’ll discover even in the Wavefront domain. Metrics in the general business sense are numerical (or quantitative) measurements that one could rely on to make decisions. They are different from qualitative (often subjective) opinions that somewhat rely on someone having the right “hunch” or a unique “eye” for things. As the movie Moneyball suggests, we are no longer living in a world where data-driven decisions are either impossible due to the sheer scale of the data volume or because of the lack of collection mechanisms (having the relevant probes). We are also able to crunch data at planet-scale and derive insights that could drive decisions as to, for instance, whether a movie script would get the green-light to production or should be shelved.
Similar quantitative measurements also meant that in the computer software and service industry, operators of complex computer systems no longer have to guess at problems or look at second-order or third-order effects to judge whether a bug exists or pinpoint a performance issue. Even with voluminous logging, one would be surprised at how bad humans are at computing rates of occurrences or even simple ratios between two sets of events by staring at log lines flying by. Worse yet, certain attempts to pre-compute rates, ratios or aggregations (and then logged as text) meant that analyzing systems at scales is impossible at best and misleading at worst (the classic example of pre-computing local p95 and averaging or computing the max across a population of systems come to mind).
In a nutshell, metrics are simply timestamped measurements that we can name. They can be logged as textual content for sure, but, as anybody will discover, storing them in their “native” format (hence with restrictions on what a single “point” ought to look like) can lend themselves to blistering fast retrieval, vastly reduced storage requirements, and post hoc analysis that shoving a bunch of textual content around could not do.
Metrics are not restricted, of course, to just the bare observations that a probe might collect. In the examples above of local pre-computations, the derived data themselves are also timestamped (examples include the rate of occurrences of a particular event, for instance, as well as the p95 of sampled events that have numerical properties, say, the latency of a request). The trade-off here is that the “closer” to the source of a metric, the more likely it lends itself to meaningful analysis post hoc.
An example of this is that while it is possible to measure CPU usage per second and send the information to a time-series database. It is, at best, a second-order approximation to potential performance issues on the machine. A much better way to isolate the problem is to measure, on a per-request level, the number of CPU seconds expended by all threads to service the request, the actual time spent in critical components, as well as the number of times these components were called. Errors can also be similarly observed accurately by components themselves emitting error metrics (in the form of counters typically) instead of measuring HTTP error rates after the whole request chain has failed, and the error has percolated upstream.
Why Metrics #1: Metrics allow surgical measurements of event occurrences in a system.
Such measurements are possible because of the proliferation of metrics libraries for different languages (e.g., Dropwizard Metrics, Spring Micrometer for Java, and go-metrics for Go, for instance). Such measurements are typically collected in three flavors: gauges, counters, and histograms.
Gauges
Perhaps the easiest to understand, a gauge is like the speedometer of your car. We can, in fact, sample the speed of your car every second, or even every millisecond, and there would be a valid, timestamped value. Hence, metric systems have to decide how often we need to collect and report the values with the idea that the more frequently that we do, the more insight we may be able to glean from the data (for instance, sub-minute microbursts of traffic will require per-second level metrics in order to detect).
Examples of gauges include things like CPU utilization (sometimes in aggregate by usage type or further by cores), number of elements in a cache, or the number of active users in a system (as determined by unexpired sessions for instance). One property of gauges is that their values can be integral or floating-point and that their values can go up or down.
Counters
Counters are perhaps a special form of gauges that never goes down. Before we explore this particular metric type, it is helpful to understand that time-series analysis often involves computing rates. For instance, the rate of requests against a system or the rate of errors of a particular method call against a component in Java.
Traditionally, systems have computed rates by intercepting occurrences of an event and emitting a rate (/s) over different time-windows (such as the rate of occurrence of an event per second over the last minute, the last 5 minutes or the last 15 minutes). Such computations usually involve a timer of sorts on the process itself so that a rolling-window calculation can be done with the relevant bins discarded over time. As systems matured and the need to compute rates of occurrences beyond fixed time windows, the concept of cumulative counters came into purview. Instead of locally computing rates, occurrences of events are simply added together and the aggregate value emitted (for instance, the total-lifetime calls to a component, or the total life-time CPU-seconds spent in executing a portion of a request tree).
An added benefit of storing monotonically increasing counters is the ability to detect “resets” and infer that a process has died, and the data (for instance, when it takes part in an aggregation) is incorrect until another collection interval has elapsed after the reset is detected. It also provides an easy way to know the life-time occurrences of events in a single process (the integral of all the rates emitted if you will).
To compute rates, a time-series database will have to be combined with an analytics engine that can take pairwise values and divide that by time elapsed. The simplest analogy could involve counters and timestamps as two columns on a spreadsheet in Microsoft Excel and rates as a new column with the formula to combine two cells, dividing against time to produce a rate.
Why Metrics #2: Metrics can be analyzed as a stream of numbers whereby local pre-computations can be eliminated in favor of post hoc analysis
At this point, I should mention that there is a special atom within Wavefront known as the Delta Counter (emitted with the metric name prefixed with the delta symbol: Δ). This was introduced for large scale aggregations of “deltas”, whereby the aforementioned cumulative counters might not make sense.
For instance, if we were to compute the rate of requests across all mobile phones on a platform with millions of users, while we can model them each as singular monotonically increasing series (with much of them being ephemeral as the app might become active only minutes every day), it would be better for each session to send timestamped “deltas” (e.g., +5 at 12:30:00PM; another +2 at 12:30:05PM for instance) and allow the time-series database to “bin” them appropriately and produce global rates accurately and quickly. Often times, this is used to reduce the “individuality” of a time-series. However, it is indeed possible to send both and retain the ability to look at individual user session’s request patterns, for instance, over time while having the ability to access pre-computed delta counters to look at rates over large populations (for instance, for a particular version and platform).
Why Metrics #3: Metrics can be pre-aggregated to produce lightning-fast aggregations at scale as long as the data type and operation are monoids
In the statement above, we introduce the concept of monoids, a fancy way of saying that the axiom of associativity (a • b) • c = a • (b • c) holds and the identity element e • a = a • e = a exists. In practice for metrics, we find that the implementation of delta counters, histograms, and HLL (HyperLogLog) benefit from monoidic operations because large-scale aggregations are possible without having to deal with the ordering of data, without needing any state information along the way, and operations can be replicated across the wire without any concept of existing values.
Histograms
With the concept of monoids out of the way, we introduce the concept of histograms in the context of metrics. Histograms are ways to analyze distributions. They are oftentimes visualized with bins and counts (think the population of a state in the US broken down by age groups). To compute a histogram, one would need a numerical property that exists for all members of a group (for the population, that would be their age; for a computer system, that could be the latency of a request, the size of a payload, the number of database calls needed to service a request, etc.).
Collecting this data turns out to be rather easy (oftentimes just “I saw something with a value of X”). However, the reporting of such data isn’t as straightforward. Revisiting the topic of local aggregations, metrics libraries will often produce statistical measurements of the distribution over-time. The statistical summaries include mean, median, min, max, p25, p90, p95, p99, p99, etc., as well as rates in all its pre-computed flavors (after all we are observing events that happened to have a property we would like to compute a distribution of).
One classic problem that is perhaps immediately obvious to the reader is that the distribution itself can change rapidly over time, and such summaries might vary wildly depending on how we draw the “boundaries” of what we need to store in memory before we compute the results. As such, it is common practice to use decaying samples (i.e., older samples contribute less to the statistical summaries) or simply outright discard the data and emit these statistical summaries at some predefined intervals (every minute for instance).
Collection and reporting of these statistical summaries aside, local computations suffer from similar issues as counters (albeit more pronounced) since the only summaries that are easily combinable across distributions are probably the {sum, count}, which does, in fact, allow one to compute the global mean, total counts, total sum, and rates. Any other statistical summaries, however, cannot be combined across processes without us first storing all observed measurements. For example, to compute the p99 latency of a system that consists of 2 machines, one does need to collect the latencies of all requests, sort them by duration, and pick the p99 element. The maximum or average of the two local p99s are not at all the true p99.
Because of this limitation, Wavefront (as well as some competitors, although this is not yet a common feature of observability platforms), allows customers to ingest histograms in the form of centroids and counts. Simply put, this is a data structure that can, in some cases, store every observation (say you have only 30 distinct values for latencies as measured in milliseconds; you can represent that distribution in perfect accuracy by sending us those 30 centroids and counts over the course of a minute, an hour or even a day). Those time bins are predefined by us, but because of the mergeability of the data-type, we can compute 5-minute histograms in real-time. And, because we have the raw observations, we can also compute the true p99 in those 5 minutes. This also means we can merge independently collected histograms across machines and produce the true p99 of a population of systems, all observing the same events passing through them.
However, as the astute reader would note, the number of centroids might be much larger than 30 (the default accuracy is 32 in the system today). This is where TDigests comes in. In his paper on TDigests, Dunning describes the need for a histogram compression algorithm that maintains tighter errors bounds towards the extreme percentiles of distributions (with e = q (1-q), hence the error can be up to 25% for the median). By prioritizing the extreme percentiles, the argument is that the ability to compute p99, p999 and even p9999 is more valuable than the median (a fact that’s largely true as analysis of running systems often involves looking at those extremes and trying to understand what’s going on, rather than understanding the behavior of the system between p50 and p60).
As the histogram data-structure and the merging of them are monoidic, one can emit histograms from a large variety of sources and any metrics platform, like Wavefront, can aggregate them upon ingestion (this is hidden from the user), producing combined distributions that, while continuing to sacrifice accuracies in the middle of the distribution, capturing accurate percentile measurements.
Why Metrics #4: Metrics can capture high-fidelity distribution statistics that are time-series in nature but allows for accurate post hoc population analysis and statistical summaries
The statement above is important because where-as metrics are by definition “detached” from individual events (they are summaries of occurrences), their strength is that we can very quickly look at trends and understand the distribution that contributes to those trends, alert upon them, visualize them, and because of that, allows the operator of complex systems a way to look at what’s going on underneath the hood without having to peer into every event itself. It gives you the forest so you can spot the trees, so to speak.
Why Metrics #5: Metrics are summarizations of events that allow an operator to quickly isolate a problem or make informed inferences about what’s going on
An example of the above is the “there exists” question. It is often important to know whether something definitively happened in the system or it didn’t. Many bugs and code push happen because developers think that the code is not being executed and/or that an error condition has occurred somewhere. Traditional logging or even distributed tracing would require full retention of all requests before one can definitively prove that. With metrics, it is relatively easy, as long as the code in question emits a single metric, to prove that a code path has never been executed or that there was not a change in the way that it is called, compared to the rest of the population, say, before and after a code push.
Distributed Tracing
Having just perhaps pointed out the inadequacies of having distributed tracing, there are perhaps obvious reasons why one would need distributed tracing: the need to look into individual requests (including every downstream call) and isolate what’s going on between components in a complex system. It is an offering that was added to the platform shortly after histograms were introduced (hence the 3rd atom of the platform, so to speak), and it represents yet another angle to observing a running system.
Distributed tracing has been around for more than a decade when we decided to embrace it (fun story, we originally thought we could build Google’s Dapper as a SaaS service only to realize that without frameworks that came later, such as OpenTelemetry, we could only not infer causal relationships and can only compute probabilities across spans, something that we thought wasn’t useful enough to warrant building a platform for in 2013). There were libraries and tooling such as Zipkin and Jaeger that allowed users to annotation their code to emit spans, store them for future analysis, and visualize them in ways that allow one to look at flame graphs.
The reason for us entering into the foray was two-fold: a) we did not want to fight the logging battle yet (something that we eventually did in 2020) and b) having built a planet-scale ingestion engine, a query language for metrics, enterprise features around policies, security, HA, DR, and backup, etc., we thought we would have a shot at claiming territory by taking the core engine and allowing the ingestion of spans (which, if one were to look at them as metrics, are similar to values being durations and timestamps being the starting time of operations).
Another strength of the platform for us venturing into distributed tracing is the fact that we also happen to have first-class support for metrics, counters (introduced in late 2019), and histograms. Anecdotal evidence suggests that while teams think they need tracing, the utilization of these platforms is abysmally low. One possible issue is that alerting on traces largely involves derived metrics (if one only has traces) and a platform with only traces is unlikely to see other data points to determine whether there is a real issue (you’d also need to build all the analytical features of Wavefront in order to support various use cases). This meant that derived metrics are just duplicated in a customer’s metric system (visualized, alerted, and analyzed no doubt), while traces are consulted only when necessary.
For Wavefront, as we have long-term retention of data (contractually at 18 months), and our charging model gives away span for free while charging only on the derived metrics (controllable by the user as to the dimensions that we collect): the alerting problem is solved (one would alert on the metrics), the analytics problem is solved (we can convert spans back to metrics and one can slice-and-dice the data to your heart’s content; the derived metrics are also stored for 18 months, so you have a “gentler” cliff when the data is purged), the ROI problem is solved (spans are free, and one would need to pay for the metrics anyways if running anything of substance), and the usage problem is solved (we can link the data from the metrics you’re looking at to traces that might be of interest, increasing the chances that during an outage, an engineer might be prompted to “jump” to the relevant traces that shows the root-cause).
Intermission
As I begin my fourth year at VMware, having recently moved back to Hong Kong temporarily, I hope to be writing more, mentoring more, and yes, building more. 2020 is a seminal year for many, I consider myself very blessed that I still get to do the things I love, with people I love.