Capital One Tech
Published in

Capital One Tech

The Mon-ifesto Part 1: Metrics

A 3-Part Guide to Better Application Monitoring

But Wait…

First, we should talk about where metrics come from. Application metrics can be generated from application performance monitoring (APM) agents and from application logs. Application performance monitoring agents integrate with your application’s internals and track metrics like response time, system resource usage, uptime, and much more. Typically, APM agents will only track a pre-defined set of metrics, but if you want custom metrics then app logs are a good place to start. Because you control what gets logged and how logs are aggregated and parsed, using logs as a way to track custom metrics is usually trivial compared to setting up a third metric-gathering method.

  • Operations metrics are things like client connections, CPU load, error rates, etc. These metrics don’t inform the actual logic of the application per se, but they do speak to the continued function of the application.

1. Calls Per Minute

When talking about an application’s size, Calls Per Minute (CPM) is the yardstick by which all apps are measured. Why? Because things like server count or packet I/O can vary from app to app depending on code efficiency and the like. CPM shows exactly how much traffic a single application can take, a metric that is often used for capacity planning.

2. Error Rate (EPM)

Similar to CPM, Errors Per Minute (EPM, or error rate) measures the number of errors your application is throwing in a given minute. Error rate differs from CPM in a few very important points.

  1. Reporting EPM in an absolute value is pretty much useless. A much more useful metric is the percentage of incoming app calls that are errors.

3. Response Time (RT)

It could be argued that response time is the most important metric that you can collect from an application. Not only does response time give indications about overall application health, but it also gives insight into customer experience and can have an impact on not just your app, but also your business or company as a whole. For companies who rely heavily on NPS scores, like telecommunications companies, slow response times could mean the difference between a good quarter or a bad one.

4. Bandwidth Saturation (BS)

In this context “bandwidth” refers to the maximum load your application can take, not just network bandwidth (although that certainly plays a part in the overall app capacity). Bandwidth saturation (or just saturation) is a little different from the first three metrics as it requires you to perform load testing on your application in order to establish some “up front” data. When load testing your app, there are a few things you want to do:

  • Test the smallest unit of your application in its environment. For a microservices app this would mean a single copy of each microservice with all of its routing and discovery and load balancing in place; for a monolith, this is a single instance of the app with its load balancing and everything else in place. This test will inform you if there are any bottlenecks in the environment outside of the application by comparing the performance data between your first test and this one.
  • Review your data and figure out what your realistic maximum CPM is, this will be your baseline for extrapolating and testing your ability to scale.

But Wait, There’s More

“But wait!” I hear you say, “What about all of my system metrics like CPU and disk I/O? Why am I wasting resources collecting these if we’re not going to use them?”

Collect Everything, Correlate Everything

No metric is unimportant. Even CPU. Why? Because of metric correlation. When there is an alarm, we want to have the entire scope of the environment, not just a few key metrics. We want to know, down to the second, what each machine, each process, each container, was doing and what resources it was using so we can drill down and find the root cause. This is where system statistics are used.

The Annoying Killer: Pager Fatigue

Because alerts kill.

--

--

The low down on our high tech from the engineering experts at Capital One. Learn about the solutions, ideas and stories driving our tech transformation.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Peter Christian Fraedrich

Entrepreneur, software developer, writer, musician, amateur luthier, husband, dad. All opinions are my own.