Thoughts from the Front-lines: What You’ve Always Wanted in a Time-Series Analytics Engine for Observability
Thoughts from the front-line as a founder, engineer, and operator of one of the largest SaaS observability platform on the planet. Wavefront started in 2013 and was acquired by VMware in 2017. Opinions are mine.
At the heart of a modern observability stack is inevitably a time-series analytics engine. Of course, it’s not the only thing you’d have, many DevOps shops have logging stacks (ELK, Splunk or Sumologic), distributed tracing stores (Zipkin, Jaeger, etc.), inventory systems (might very well be API calls to AWS/GCP/Azure and managed with things like Terraform/Infra-as-Code) and event/incident management systems (Bigpanda, Moogsoft, etc.).
The focus of this story is how we (Wavefront), operating Wavefront, built, what believe, the best-of-breed time-series analytics engine, operated it at scale and how it still puts a smile to our face (and our customers) when we use it every day to serve our customers' 70 trillion data points (and counting) they have entrusted to us.
Criteria for Success
There are many time-series database out there: open-source, commercial, commercial open-source, etc. and it’s a crowded space to be sure so what would a modern DevOps team consider to be a good time-series analytics engine?
You need something that can take data points from your applications, your infrastructure, your cloud fabric, even your user interactions. You want all of that because when you have an issue or wanted to explore correlations, you don’t want to plop them into Google Sheets or fling timestamps around just so that you can see how two things interact.
Agents are only part of the story here, we thought long-and-hard, and concluded that there are probably too many agents out there already (it’s not surprising to see a handful these days on production machines, we use Scalyr for logging and Lacework for security for instance). Nowadays, we recommend using Telegraf (from the TICK stack). We are also active contributors to the project itself.
Most of our metrics, however, come from language SDKs, Kubernetes control planes, cloud APIs, etc. Someone, somewhere, will be sending you data from curl or from a program written in an obscure language — as long as it can talk to a socket.
You want something that scales up and down, and fast. Scale is an overloaded term — many benchmarks of time-series databases do bulk loads which are inherently not how real data comes in. When you spin up a thousand containers you want your system to work just like you’d expect it to. Not a minute later, not 15 minutes later, not when you found resources to throw at it. Scale is also the speed at which you can pull data back out of the system and perform computations on top of it.
We observe that data is queried at 10x the ingestion rate typically but unlike most datastores, the scans are concentrated on 2% of the ingested data (hence the datastore needs to be able to scale for lobsided reads, Cassandra, for instance, without any awareness of read/write bandwidth and shard rebalancing would just have to be inefficiently sized for the workload). We use FoundationDB (have been since the early days) and full-disclosure, I am a member of the Program Committee for the upcoming FDB Summit on December 10, which runs concurrently with Kubecon and CloudNativeCon in Seattle. I might be a little biased as to what Key-Value store is the best out there. :)
You can probably get scale and flexibility even if you picked a traditional RDBMS (although running one at 1M writes per second sustained in HA/DR configuration is probably not a walk in the park). That, and you will probably quickly run into the need to perform sliding window moving averages, aggregating streams that are not aligned, wants predictable autocomplete with trigram searching, etc., and SQL is probably not the language you want to use. It’s just not designed for time-series analytics where you want composable functions, first-class concept of time, and data as streams and not rows.
At Wavefront, we built composable functions with common sense math operators that operate on “key-ed streams” (simply put, think of it like passing variables of the type Map<Key, Stream> but not having to deal with types at all). Our users seem to love it and, again, full disclosure, it is heavily inspired by Viz @ Twitter and Borgmon @ Google (although even though I had Borgmon “readability” back in the day, it was considered an oxymoron, i.e. “to have readability in Borgmon… it wasn’t readable, to begin with…”; that story is an important one on how to design a language, but I digress).
All systems exhibit some latency (speed of light problems) but the reason why we don’t just dump all the data into an OLAP database with all the materialized views of how we want to dice the data is because you can’t collect a day’s worth of metrics and bulk insert them (OLAP systems also preclude data mutability, something that our customers actually want, as we found out). We find that average insertion latency needs to be <2s for the system to “feel” real-time and that queries themselves need to be faster than 100ms on average in order for users to feel like they are “conversing” with metrics (or as we call it, surfing metrics). This is largely because systems with high query latencies “penalizes” users for not finding the right thing or for not being able to incant the exact expressions for their intent (we do this all the time with query refining in Google searches BTW, if Google searches take 2 seconds to complete, we’d think a lot harder before we hit “enter”).
We happen to have a streaming architecture (with patents on composing un-aligned streams, continuous vs. non-continuous data, and synthetic time series, etc. to name a couple) that drives a single point from disk through stream operators all the way to the browser (through Server-Sent Events).
“Everything should really be in the form of iterators. I like the idea of an iterator next() call reading data off the disk and streaming it straight to a browser as fast as possible.”
We sometimes see people refer to Apache Spark or any streaming platform as the “proper way” to do time-series analytics but at the end of the day, every system needs to have a persistent storage story, an event bus can only give you “from now on” semantics, Gmail wouldn’t be very useful if you need to land on a browser and it just shows you your emails from that point onwards. Arguably, new data can only be as voluminous as what is the ingestion rate of the system; analytics are usually done over much larger data-sets.
The ability to ingest, store, visualize, and analyze time-series with data would be complete if you just have humans looking at charts 24/7. Many open-source offerings separate alerting with the engine which means you need to stand-up yet another system that may or may not understand the specifics of what it means to be monitoring a plurality of time-series, with differing thresholds triggering different remediations/notifications, and the nuances of maintenance windows (how they play with your tagging strategy for instance). Just a simple alert of “please email me if any system’s CPU goes over 80% and page if it goes over 90% for 5 minutes” would require an alerting system to know how to properly remember where each time-series was in the last cycle (not to mention series can disappear altogether, escalations can happen for a single series, which means a single alert is effectively N alerts where N is the number of series it monitors multiplied to M notification strategies based on severity). Our largest customers monitor millions of time-series with tens of thousands of alerts (many requiring historical data from weeks or even months of data).
Again, we have heard that if one stood up AWS Kinesis streams it would solve everything but alas I don’t want to sound like a broken record since a) historical data, and b) backfills are a fact of life (ts_now > ts_before is highly undesirable since we don’t have accurate GPS clocks on every machine and you don’t want to sort time-series before you ingest them =p). At Wavefront, our customers expect to be able to ingest data from a month ago, 6 months ago, and overwriting them again and again (scripts are mostly what’s doing that since some observability data are aggregated from other systems for various reasons). We also have the ability to ingest derived metrics from the result of queries (think “INSERT INTO … SELECT …”, a feature we borrowed from Kapacitor in the TICK stack) and it basically mandates that the system allow overwriting of existing data points.
We have also collected a whole set of like-to-haves for a time-series analytics engine and observability platform in general. Some might think that they are need-to-haves for sure but the six points above are table stakes in my opinion.
Hope to share more in the next story.