The Mon-ifesto Part 1: Metrics

A 3-Part Guide to Better Application Monitoring

We live in an interesting time when it comes to technology. As more and more applications change to become API-focused and less stateful, so too does our mindset need to change. Right now, we’re seeing a great battle for data and metrics being waged in the halls and conference rooms of companies around the world. Everything is becoming increasingly metrics-driven, from hiring and retention, to business acquisitions, to application frameworks. But if everything is metrics-focused, how do we determine which metrics to pay attention to and which metrics to ignore? This question is where the frontline of this war is being waged.

But Wait…

First, we should talk about where metrics come from. Application metrics can be generated from application performance monitoring (APM) agents and from application logs. Application performance monitoring agents integrate with your application’s internals and track metrics like response time, system resource usage, uptime, and much more. Typically, APM agents will only track a pre-defined set of metrics, but if you want custom metrics then app logs are a good place to start. Because you control what gets logged and how logs are aggregated and parsed, using logs as a way to track custom metrics is usually trivial compared to setting up a third metric-gathering method.

(Side note about application logs: logging from your application is important, but only if you are logging correctly. At the very minimum, application logs should give an engineer an inside look at what exactly the application was doing at the moment the log was written. Provided you have the infrastructure in place to handle the volume, you should be logging everything and using logs to correlate events and generate metrics in addition to application performance metrics.)

Secondly, it’s important to make a distinction between operational metrics and business metrics. An application will generate both and separating the two is your first step to achieving metrics zen.

  • Business metrics are metrics generated by the business logic in your application, things like signups, buy-flow step hits, page hits by content, etc.
  • Operations metrics are things like client connections, CPU load, error rates, etc. These metrics don’t inform the actual logic of the application per se, but they do speak to the continued function of the application.

Secondly, when referring to an application, I am really referring to an application service. Gone are the days of single instance applications serving all traffic to that one instance of the application. An application service is a group of application instances that collectively perform the function of a whole application. Whether this is accomplished by a distributed microservice architecture or multiple copies of a monolithic app sharing traffic (or some hybrid somewhere in-between) is irrelevant for our context.

With that in mind, let’s dive into our first metric: calls per minute.

1. Calls Per Minute

When talking about an application’s size, Calls Per Minute (CPM) is the yardstick by which all apps are measured. Why? Because things like server count or packet I/O can vary from app to app depending on code efficiency and the like. CPM shows exactly how much traffic a single application can take, a metric that is often used for capacity planning.

From an operational point of view, it’s important to monitor CPM so we have a real-time snapshot of what is going on with our app, whether customers are able to get to our app, how many customers we’re serving, and if there are any problems that need our attention. Large unexpected spikes in CPM are the kinds of things that keep engineers awake at night, the freak events that can lead to a catastrophic meltdown of services and servers. But it’s not just increases in CPM we need to worry about either, its sudden decreases in the number of API calls that should worry us too.

But how do you monitor something like CPM that doesn’t have static success/fail values? What is typically done here is to monitor the deviation from a given baseline, where the baseline is established over an arbitrary length of time (e.g. 14 days). Using deviation from a baseline will allow for changes in traffic patterns that occur organically — either from a change in consumer pattern or changes in application architecture — while still alerting on spikes and dropouts. Keep in mind there is no magic number when it comes to how much deviation is acceptable, it should and does vary from app to app. While App A might have a very steady CPM rate and begin alerting on a deviation of just 10% from baseline, App B may not and a low percentage would throw unnecessary alerts. The best way to explore this is through testing and trial and error. Adjust your deviation percentage until you find one that works for you and your application.

2. Error Rate (EPM)

Similar to CPM, Errors Per Minute (EPM, or error rate) measures the number of errors your application is throwing in a given minute. Error rate differs from CPM in a few very important points.

  1. You never want to open alerts on decreases below baseline, you want to close alerts. A drop in error rate is a good thing.
  2. Reporting EPM in an absolute value is pretty much useless. A much more useful metric is the percentage of incoming app calls that are errors.

Let me give an example of this. App A has an error rate of 50 errors per minute. By itself this number means nothing because it lacks context. If App A has an overall CPM rate of 25,000, then an EPM of 50 is not a huge deal, typically under the alerting threshold. But even expressing these as absolute values is a little unwieldy, especially if you’re working with numbers that aren’t nice and round like our example.

Percentages give a much quicker picture of the overall app performance. In our example, an EPM of 50 is just a 0.2% error rate. Just as with CPM, there’s no magic number as to what your EPM threshold should be at, but a good starting place is generally 10%.

3. Response Time (RT)

It could be argued that response time is the most important metric that you can collect from an application. Not only does response time give indications about overall application health, but it also gives insight into customer experience and can have an impact on not just your app, but also your business or company as a whole. For companies who rely heavily on NPS scores, like telecommunications companies, slow response times could mean the difference between a good quarter or a bad one.

With response times, it’s important to capture a good baseline for your application. Like with CPM, you’ll want to monitor deviations from the baseline instead of just a “dumb” threshold. This is because like with CPM, a sudden drop or spike in response time indicates an issue with your application either internally or externally in its environment. You could, however, combine a threshold monitor with your baseline deviation monitor (in fact, if your application monitoring suite allows for it, I would recommend it). Real-world response times are rarely in the single-digits. If your application starts throwing times in the 1 to 10 millisecond range either your APIs aren’t actually doing any work (like a simple “ping” check, in which case you’d want to create special rules for that API) or they’re immediately returning errors.

One other way that response time can indicate app issues is if you see a correlation between CPM and response time. For most APIs, there will be a natural ebb and flow of API calls resulting in a graph similar to a sine wave with peaks typically during the day and valleys at night. (Of course, not all apps are the same, and for some companies I’m sure this doesn’t hold up) You don’t want your response time graph to track with your CPM graph; if it does, that’s usually a good indication that your app is either poorly written or overloaded, which brings us to the final metric…

4. Bandwidth Saturation (BS)

In this context “bandwidth” refers to the maximum load your application can take, not just network bandwidth (although that certainly plays a part in the overall app capacity). Bandwidth saturation (or just saturation) is a little different from the first three metrics as it requires you to perform load testing on your application in order to establish some “up front” data. When load testing your app, there are a few things you want to do:

  • Test a microservice or monolith instance separately to establish its maximum individual load (for microservices you would want to do this individual test with each microservice, if possible).
  • Test the smallest unit of your application in its environment. For a microservices app this would mean a single copy of each microservice with all of its routing and discovery and load balancing in place; for a monolith, this is a single instance of the app with its load balancing and everything else in place. This test will inform you if there are any bottlenecks in the environment outside of the application by comparing the performance data between your first test and this one.
  • Review your data and figure out what your realistic maximum CPM is, this will be your baseline for extrapolating and testing your ability to scale.

Once established, if you scale past this baseline to multiple instances or varying numbers of microservices and start seeing performance hits, then something is wrong and you should go fix it. In an ideal world, your scaling is based on this baseline multiplied by the number of instances or microservices. Assuming everything checks out, you can use your new baseline to calculate your maximum desired application bandwidth. Let’s look at how this works.

In our example, App A can take 10,000 CPM per application instance, but at 9,500 it starts to destabilize a little bit, so for safety we say that 9,000 CPM is our baseline CPM. If we have 5 copies of App A running, then our max bandwidth is 45,000 CPM with about 2,500 CPM “headroom” in case we hit some spikes and can’t scale fast enough. If App A has a real-world CPM of 30,000 CPM, then we can say that we have used about 67% of our total bandwidth.

Again, there’s no magic number that says, “If you hit X% bandwidth you must scale.” Every application is different and takes vastly different traffic patterns, so experiment and figure out what your magic number is. Once you do, you’ll be able to set alerts to that threshold and simplify your scaling policy.

But Wait, There’s More

“But wait!” I hear you say, “What about all of my system metrics like CPU and disk I/O? Why am I wasting resources collecting these if we’re not going to use them?”

Surprise! You are going to use them, just not how you’re used to. Gone are the days where engineers wake up in the middle of the night to respond to Nagios alerts on CPU or Memory. Because the base hardware, and even operating system, has become so abstracted from the application we have to start thinking about our underlying systems differently. Here’s how we’re going to do it: collect everything, alert on four things, correlate everything.

Collect Everything, Correlate Everything

No metric is unimportant. Even CPU. Why? Because of metric correlation. When there is an alarm, we want to have the entire scope of the environment, not just a few key metrics. We want to know, down to the second, what each machine, each process, each container, was doing and what resources it was using so we can drill down and find the root cause. This is where system statistics are used.

We can take this “shortcut” because if there’s a problem — a real problem — on one of our app hosts, the symptoms will bubble up into one of our four key metrics. Heavy CPU load will show up as slower response times, dropped network packets will show as errors in our EPM, TCP connections not being closed will show up in just about everything. That doesn’t mean you don’t want to pay attention to them; by all means, create CPU metrics dashboards or disk I/O counters and whatnot. Just don’t you dare create alarms for them. Why?

The Annoying Killer: Pager Fatigue

Because alerts kill.

Think of alerts like caffeine: a little bit here and there will give you a jolt and send you running, too much and it will literally kill you. “Pager fatigue” is a term coined to describe the phenomena where too many urgent alerts (or pages) stop being alerts and just become noise, emails for you to filter out of your inbox or Slack channels to mute. There’s too much work and too many alarms to answer that you might as well just ignore everything — and that’s a real problem for production apps. We’ll go over alerting and responses more in depth in later parts of this series, but for now, just know that too many alerts can be a Very Bad Thing.

By keying in on these four simple metrics, you will not only reduce the number of alarms, but you will ensure that every alarm that does get triggered is important. This way, your engineers will spend more time running down actual issues instead of phantom problems.



DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.