Argus: Time-Series Monitoring and Alerting
At Salesforce, we run a platform for mission critical, 24x7 business apps. This means our engineering teams need ways to keep their eyes on the health of every production service — to understand performance, behavior, capacity, and more. Over the years, we’ve iterated through a number of tools — both proprietary and open source — to support this level of production visibility. We’ve learned a lot about what works and what doesn’t, and how to make observability run at scale.
Today, we’re proud to announce that we’ve open sourced our latest tool, in development for over a year — Argus: A Time-Series Platform.
What’s Argus? It’s a time-series monitoring system, named after the hundred-eyed giant of Greek mythology (who got turned into a peacock). Argus allows engineering teams to collect, store, annotate, and alert on massive amounts of time-series data, using a scalable, resource-protected architecture.
That’s a pretty dense description, so let’s unpack that a little bit.
What’s a Time-Series?
A time-series is a discrete view of some value, as it varies over time. Think of a quantity that changes from one moment to the next — like, say, the price of a stock, or the blood pressure of a hospital patient — which you can sample at regular intervals. That’s why we call it a series — it’s a list of “snapshots” of that value:
Time-series telemetry can be used for many things: monitoring an application, detecting anomalies, or even passing along into other apps. You can really do anything with it! For our purposes, of course, what we care about is being able to observe the state of our services in near-real-time, to keep them healthy.
Argus collects time-series data from various data sources, stores it, and lets you view it via queries and dashboards. What’s even cooler, though, is that everything in Argus can be accessed via a REST API. That means you can use Argus as the back-end for other applications that need to collect or display time-series data. (In fact, the out-of-the-box Argus UI is really just a reference implementation on that same API.)
The model for data in Argus is pretty straightforward. Metrics have:
- A name, which describes whatever you’re collecting. (For example, “heap size” or “cpu % used”.)
- A scope, which is a categorization of where that data is coming from–which server, data center, etc.
- A namespace (optional), to ensure different teams don’t accidentally use conflicting names.
- A set of optional tags (key / value pairs), to code more attributes into the metrics.
- Some additional functional instructions, like an aggregation function (to combine multiple metrics) and an optional downsample function (to create lower-grain reductions of the source data).
And then, as you collect data, each sample for the metric has a value and a timestamp. These are what you push in via the REST interface from the metric collectors in your infrastructure.
Let’s be honest: there are plenty of time-series systems out there in the world today. Many of them are great! In fact, we’ve used many of them, including the open source Graphite and OpenTSDB (which, as you’ll see, forms one of the core layers in Argus today, running on top of Apache HBase).
Why did we create Argus, and why would you use it? What does it add, beyond what you can find in these other systems? Read on!
Annotations: Putting Data in Context
Time-series data is useful, but it doesn’t tell the whole story by itself. To be meaningful, it has to be understood by humans, who want to see things in context.
For example, take a look at this time-series graph of memory usage. It shows that we’re using a fixed amount of Java heap memory until just before 6pm… at which point we restart, and are suddenly able to use a lot more!
But, uh… what actually happened to cause that? Well, that’s what annotations are for; they let you overlay a time-series with significant events that happen in the real world:
In this case, we had deployed a new release that changed a low-level JVM setting, initialCodeCacheSize, which had been inadvertently limiting our memory usage. (This release also happened to improve the performance of Salesforce across the board by a massive factor–the kind of find that happens once in a blue moon.)
Annotations are great, because they help engineering teams make sense of time-series metrics.
Unfortunately, they’re not used very widely, because they can be hard to use. In some systems, annotations are pinned to a single time-series. This is fine if you discover something you want to annotate about that specific graph, but it misses the point that most significant events affect many time-series.
The thing that makes Argus annotations super useful is that they are a first class citizen. Unlike in other systems (like OpenTSDB), where annotations are tightly coupled with an individual time-series, Argus annotations stand on their own and can be overlaid on multiple time-series. In the example above, the event of that release going out wasn’t unique to how much heap memory was being used; it could be relevant to any metric, and you might want to overlay it on any time-series you’re looking at. (Argus also supports multiple annotations with the same timestamp, which can be a limitation of other systems as well.)
Alerting: When the Metrics Hit the Fan
Most people (well, other than mythical Greek giants) have a pretty limited number of eyes, not generally exceeding two. Thus, the way to keep a service healthy is not to sit around all day and watch graphs. Instead, you want to teach your monitoring systems to do the watching for you — you want to build alerts: conditions that can automatically send a notification when something isn’t right.
In Argus, you can set up as many alerts as you want, and they’re very flexible. Each alert can have multiple triggers, and can cause multiple notifications when fired. Notification destinations currently include email, Salesforce Chatter, and a database-backed audit log. (We’ve also built an integration into our own Global Operations Console, for our Site Reliability Engineering team.) Argus can integrate with many other things (like PagerDuty) via the email mechanism. But, if you have ideas for more integrations (Slack, Hipchat, etc.), it’s pretty simple to contribute that via a pull request!
Now, lots of systems have alerts. Argus goes above and beyond the standard features in a couple ways. First, alerts in Argus are stateful; meaning, when an alert fires, Argus remembers that it’s currently firing, and when it subsequently continues to evaluate that trigger, it modifies the notifications accordingly. It can also send a notification when the alert clears, and indicate a “cool down” period before which the alert is allowed to fire again. (This helps tamp down “flappy” alerts.)
In addition to this, alert trigger definitions in Argus can include the concept of “inertia”. This is like saying, “I only want to fire this alert if this condition persists for a certain length of time”. That way if you have transient spikes in some metric, you can avoid sending lots of false positive notifications. (For those of you into signal processing, think of this as a low pass filter).
Third, Argus can actually alert not just based on the value of metric data, but also on its presence (or lack of presence). That means that missing data can also trigger an alert, which is helpful to ensure that a quiet day in the SRE command center doesn’t just mean that all the sensors stopped working!
Argus has a very scalable capability for evaluating alerts: it can check 40,000 triggers per minute, per host! (We’ll talk more about this is powered by asynchronous scheduling using Apache Kafka, below.)
Resource Protection: Call the Warden
Argus is a multi-tenant system: one set of computer resources is shared by multiple independent parties. (This is how Salesforce works in general, so it’s something we’re pretty used to engineering for.) In the case of Argus, those parties are the many teams who are each supporting their own services in production.
One of the key challenges with multi-tenancy, of course, is: how do you keep people from stepping on each other’s toes? Specifically, by using too many resources, like sending too much data, or running too many queries. This isn’t a trivial problem to solve when you are creating a platform, a system where people can develop their own applications on top of an API. You have to both detect this behavior, and act on it.
Argus’s resource protection service is called the Warden. Individual user accounts that push and query from Argus are authenticated, and their usage is subject to configurable limits, per subsystem (and globally). For example, you can say “Bart can only push in 100K data points per day”, or “Homer can only run 100 queries per hour”. The alerts and metric content can also be limited: you could say “Lisa can only set up 300 alerts per hour”, or “Maggie can only push in data with a minimum resolution of 200 milliseconds”. When any of these limits are reached, the user is temporarily suspended from that subsystem.
The policy is actually even more configurable than that: you can set up “levels” of suspension, so for a first infraction, the penalty is light, and for subsequent breaches, it goes up.
What’s neat about the Warden architecture in Argus is that it’s built using… Argus! It uses the exact same mechanisms for both alerting notifications and for Warden actions. Our team is also hoping to break out Warden as a general-purpose resource protection framework in the future (feel free to chip in if you want to see that happen!).
Warden has one more benefit, which is really important to Salesforce: Trust. Because all users of Argus are authenticated, not only can we control their access to data namespaces, but we can also audit all engineers’ usage of the system. This is important for maintaining the high standards we have for regulatory compliance: it’s not just about controlling who can do what, it’s also about reviewing and auditing who actually did what!
Scalable, All the Way Down
And finally, all of this goodness is scalable: massively scalable, in fact.
For starters, Argus is built on top of a stack of pluggable, scalable, open source components. The underlying data store, out of the box, is Apache HBase, a horizontally scalable database that’s part of the Hadoop ecosystem. (We use HBase for lots of other things at Salesforce, too! We’ll talk about that more in a future post.) On top of HBase, Argus uses OpenTSDB as a layer to store and retrieve the time-series data. Intra-application messaging runs in Apache Kafka, and scheduling for alerts and service protection is done in Quartz. All of these components are designed to scale by simply adding more machines to your cluster.
Argus takes scalability even further, though: the system is also deeply asynchronous. With the exception of running user-facing queries, everything in Argus is asynchronous. This means that when you push in a new metric via the REST interface, for instance, it’s actually queued into Kafka, which acts as a “shock absorber”. Thus, even if the other backend services are down, the client services are not affected. Once data is available on this bus, it is consumed by an army of asynchronous clients that are constantly reading data from the bus and performing their own tasks: storing data in HBase, storing annotation data, evaluating alerts, etc. Once the data has been committed to HBase, it can be immediately viewed in queries and dashboards.
Why is it so important for a system like this to be scalable? Well, for one thing, web-scale systems just plain get big. Salesforce runs gobs of servers, in data centers all over the world, and all of them emit metrics we need to watch, with low latency. That can really add up. As a general rule, we prefer building systems like this: those that can scale horizontally, and where we don’t much care if we lose any individual server.
Since Salesforce is a multi-tenant environment, our needs for scalability actually go one step further: not just by server, but also by tenant! Argus’s scalability allows us to break out metrics (like for example, the average latency for web requests) on a per-tenant basis. Thus, if one of our enterprise customers wants insight on their app performance or usage, our metrics infrastructure can collect that, across many, many tenants at the same time. This is a major driver for scalability.
Just how scalable is Argus? By the numbers: on a small standard deployment (1 application server, a 4-node Kafka bus, and a 12 node HBase cluster), we can sustain:
- 25 million data points a minute write throughput
- 1-second-resolution time-series
- p95 read latency of ~ 10s for trailing 1 month (dependent on tag cardinality)
- p95 write latency of ~100ms
- Write settling time of ~30s
- 40K alerts per minute
And, as you add nodes, these numbers scale nicely–you can support 250 million data points per minute write throughput, with the same latency of access, by using a cluster of ~25 HBase machines, combined with 10 application servers and a 20 node Kafka cluster.
How big can it go? We’re not sure; certainly, Kafka and HBase are known to work into the range of hundreds or thousands of nodes, though we’re likely to run into other scalability bottlenecks (like network) first. If you’ve got data needs in that range, talk to us; we’d love to work with you on getting to a billion points per minute!
As you can see, we’re pretty excited about Argus, and hope you’ll join us in directing its future. Some of the things we hope to build and collaborate on include:
- Plugging in new alert notification interfaces
- Providing implementations of more schedulers, data stores, and pub/sub platforms
- Supporting asynchronous queries, for very large reads
- Spiffing up the UI, and integrating with more collaboration features
We hope you give Argus a try, and send us your ideas and feedback.
Thanks to the whole development team, who created Argus and continue to evolve it: Anand Subramanian, Bhinav Sura, Dilip Deveraj, Jigna Bhatt, Kirankumar Gowdru, Rajavardhan Sarkapally, and Ruofan Zhang.