Thoughts from the Front-line: Operating at 99.95% Uptime
After writing my previous post on what are table stakes for a Time-Series Analytics Engine for modern stacks, my readers alerted to me that I have left something out that’s clearly a requirement (if not the requirement) from our customers. So before we go and talk about the nice-to-haves, let’s take a trip down memory lane and get some definitions straight.
How do you define “up”?
This is a tough question since with modern microservice architectures, you can almost always serve a page (that’s different from serving all requests properly of course). For Wavefront, we have a mix of flight.js and React.js code (transitioning to the later) so even if AWS is down, a browser with everything loaded will continue to “work” until you refresh the page.
Arguably, you can look at our SLA documentation on how we define “up” (something that our customers tell us isn’t that easy to get from our competitors which is odd). It defines an outage (at 99.95% availability monthly, you have basically 20 minutes) as having no points ingested for 3 minutes or more, no queries executed for 3 minutes or more, no alerts checked for 3 minutes or more (and things of that sort). Having an SLA matters mainly because then there is a system by which you can credit customers (Algolia, which is a Wavefront customer and us their customer, offers 100x rebates; we offer 10% of the bill for each incident, note but I am not a lawyer, please really read the fine print =p). But in the final analysis, the fine print doesn’t really define what we mean by the system being up.
The reason is that if the system is slightly degraded, many of our customers immediately go into code/deployment freeze. It might not even be us that’s having the issue even (our MillionEyes project has probes around the world to make sure we aren’t having an upstream issue). It doesn’t help anyways that we can credit customers when the system is “out” for 5 minutes and queries just took longer to load, we will be getting a call and it’s all hands on deck to figure out what the issue is. Engineers don’t care about credits.
For that reason, operating Wavefront meant tons of engineering hours trying to hit our SLIs and SLOs.
SLIs and SLOs
SLIs are indicators (metrics) that we measure that indicate the general happiness of the platform. Since we offer high-resolution histograms (basically TDigests as a service, but that’s really downplaying what it is to be honest), tracking medians, p95 and p99 of all kinds of operations and computing the true measurement of those across a cluster is quite easy. Query latency is obviously a critical metric, ingestion push backs (which are quite rare these days but it could be that a rate limiter tripped) and then finally the alerting period (alerts are ran every minute, it doesn’t matter if you have tens of thousands of alerts or just a handful, our system will automatically scale up to handle them but it’s always possible that you added a thousand more and we need to scale up).
Since we operate in HA configurations (more about this later), 1) the time that we are truly in HA (meaning we truly have two share-nothing, in-sync mirrors and can switch serving in an instant with automated logic), 2) the time that we are in HA but one mirror is behind, and 3) the time that both mirrors are behind (possibly because the customer lost connectivity to us and is now sending a barrage of points to backfill, also more about this later) are also key SLIs that we track.
SLOs are monthly aggregates of these SLIs so that we can score and track our progress towards perfection. There are good months and there may be a month when AWS knocks out an AZ because of a nasty EBS bug (just 2 months ago), but having the numbers help keep us honest.
So what’s the deal?
So the question is whether SLAs are important. I suppose for a commercial Observability Platform, they matter in the sense that we don’t like buying things with no guarantees. At the same time, it doesn’t matter much because it doesn’t mean that you’ll get a service that works 100% all the time, without fail. Slowdowns are a fact of life, and even Chrome releases have caused issues with rendering. Only a fanatical focus on reliability and quality can separate the best from the rest.
Does it matter?
We have heard some say that reliability of your observability platform doesn’t matter since, well, “it’s just monitoring”. Since we use Wavefront to monitor Wavefront, we understand the feeling when you’re flying blind — a very uneasy and panicky feeling (imagine if you’re flying a plane at night and the instrument panel just went dark). As I said above, our customers literally go into code freeze if we are out and we go into release freeze for them if they are in a sensitive moment (note this is only if you go with the dedicated cluster option). The greatest feeling for any Wavefront engineer is perhaps to walk into a customer NOC and see that the software is what’s driving all the monitors. It’s literally how they know the business is humming along. It’s the instrument panel of the car.
So I firmly believe that it matters whether your monitoring setup is up, and not just the “hitting the SLA” kind of up but the “cruising at 65 mph with much horsepower to spare” kind.
High Availability comes Standard
With Wavefront, even if you’re in a trial cluster, your data is kept in HA configurations. That means that we keep a total of 4 copies of your data on 2 availability zones. It also means we route traffic through 3 separate witness nodes that allow us to steer traffic in a moment’s notice if there is anything odd with an AZ. We had to run without ELBs because we felt it was just too opaque as to what’s happening. We run fleets of nginx machines for traffic routing with custom code strapped on them to rewrite routing rules on the fly. For dedicated customers, we also offer a direct VPN drop or IP whitelisting.
Data replication is designed to tolerate complete AZ failures and queue data up for weeks. Within an AZ, we can tolerate complete storage failure of EC2 instances (happens when status check fails), EBS failures (fact of life but not as often as machine failures from our experience) and network partitions (FoundationDB has its limits when it comes to partial partitioning, however).
DR options are also available which takes HA to a whole other level by having triple redundancy for operational data (dashboards, alerts, users, etc.), a witness region to make decisions with a remaining region if one region is out and global locality-aware DNS traffic routing.
The Wavefront Proxy
I guess I did say previously that we are going with Telegraf as the agent of choice for us but I omitted one piece of (open-source) software that we do ask customers to run and that’s the Wavefront Proxy (or Secure Gateway I guess, because that’s what other companies call things like that to make them sound sexier =p).
It’s a piece of Java code that relays data from customer datacenter to our service in the cloud. Now, it’s possible to just send data directly to us via https but with proxies, you can configure per proxy limits, scrub sensitive data, do all kinds of transformations you can dream of and most importantly, survive a connectivity outage.
The last point brings me to perhaps the most interesting challenge any time-series analytics engine faces: traffic doesn’t just go away and come back, it comes back with a vengeance.
Any outage of our end or on the customers' end will inevitably lead to a reservoir of points on the proxies and you have to include that capacity in traffic sizing in order to keep up.
This capability also means that our customers never lose data because the Internet is out or in the unlikely event that Wavefront is down. This is something that’s often overlooked with homegrown solutions since you can always tip your hat, say sorry, and move on with the data after the outage. With Wavefront, we really take care of your every data point.
The more I think about it, the more dimensions of a time-series analytics engine come to mind which are more need-to-have than nice-to-have. We’ll see what happens when I start thinking more about the second category I guess.