A Million Metrics per Second

Published in

Swissquote Tech Blog

4 min readMay 30, 2017

Introduction

Less than two years ago, the number of deliverables at Swissquote started exploding, gifting IT Operations with the headache of monitoring an ever growing fleet of hardware, virtual machines and applications.

Traditional tooling like Nagios, PNP4Nagios and application generated email reporting started to become unwieldy. Aside from the performance bottlenecks of scaling monitoring for all applications and servers, in hindsight we just didn't have enough data.

Today we count 1000+ webapps, microservices, batches and daemons; what follows is part one on how we survived.

Never Bring a Water Gun to a Firefight

Enter the inevitable: there's an outage.

The board is red. Hundreds of incoming alerts status:critical. Java Log4J FATAL emails have DDOSed and taken out our Exchange Server. Seemingly unrelated applications are down. Databases are drowning in connections.

Those who have experienced the stress and adrenaline of multi-hour outages, where the CTO and CFO have stopped calling because they are standing behind you, know what I'm talking about. There is a special sinking feeling in your gut when you realize:

We have no idea where it’s coming from

What we did until our first major outages of this type was to monitor with the usual suspects:

Processes up checks
TCP/Network checks
Basic http checks with pattern matching
Log pattern matching (the classics: ERROR|FATAL|WARN etc)
Applications monitored with a healthy mix of default and on-demand combinations of the above

In a complex ecosystem, having a single dimension of checks just isn’t enough. How does one application impact the other? Was there a trend we missed? What is normal? We were completely unequipped to handle the situation gracefully.

Establish a Baseline

Our first step in the right direction was figuring out we needed to collect as much data as possible on running processes over time. We are a 99% Java company, as such we determined that the quickest way to collect was recovering mBean metrics via JMX protocols and dumping it into a time-series database: Graphite.

10,000 metrics/second
At first we collected the built in java metrics: memory, threading, garbage collection.
30,000 metrics/second
Then we added Apache Tomcat and Caucho Resin mBeans of interest
50,000 metrics/second
Realizing that we have no reason to hold back, we activated scraping of all JVM metrics for tuning and trend analysis
100,000 metrics/second
Collection started on the heavily used microservice latency and fault tolerance library Netflix Hystrix

During outages we are now able to reach conclusions based on hard data!

Visualizing the Data

Incident response efficiency comes down to using knowledge to eliminate possibilities from most to least likely. Saving your vulnerability identification knowledge as metrics dashboards allows you to identify and discard vectors more quickly under fire. We started with the following:

List the most common offenders

Java JVM memory and garbage collection
Database Connection Pool Exhaustion
Microservice dependency timeout snowball
Thread Leaks

2. Profile each offender, identify metrics useful for diagnosing them.

3. Create dashboards for each offender using identified metrics with Grafana

Data Overload

We have metrics, we have dashboards, in Data we trust; collect it all!

500,000 metrics/second
mBean collection expanded to all applications for Java, Spring, databases, caching, Hystrix and Swissquote specific metrics

When you have so much data, it becomes tricky to keep visualization simple. It is easy to fall into the trap of making one dashboard that rules them all. Don't. In the end there is just too much noise and your browser will explode. Our advice is to keep your dashboards targeted, if you have to scroll you should probably refactor!

There are a lot of abstractions available like templating and variables in Grafana which can help refactor your boards, analyse trends and correlate completely different metrics. Seriously, read the manual.

Eliminating Noise

With all the functions and data available to you, dashboards are not the only benefit, you can (and should) set alerts on your trend metrics for a more pro-active approach. Chances are you'll be able to cut back on noisy mark-as-read-style alerts and fight against alert fatigue.

Graphite itself ships with some very interesting statistical functions for eliminating noise and catching problems that develop over time such as:

Linear Regression
Classic example: calculate time until disk full

Exponential Moving Average
Moving average + the exponential = reacts faster to recent change
example: RAM usage trend

Holt Winters Aberration
Detect abnormal patterns, eliminates “known” patterns over time
example: remove periodic/expected network spikes from alerts

Taming the Beast

Today we spend more time tackling potential problems, rather than firefighting active outages, and our platform is significantly more stable than it was 2 years ago.

But when the inevitable does happen, and some new horror emerges from the complexity, we have the tools ready to zero in, analyze and destroy it.

If data has been one of the largest contributors to our situational awareness, then why on earth would be stop at a cool Million? We're coming for you data…

This is only the beginning.

Peaks of 1,100,000 metrics/second
Activated CollectD on httpd, nginx, system, disks, networks, redis, etc.

Swissquote Bank SA, IT.Operations, Graphite Carbon-Relay Ingestion rate, 2017–05–25

Header Image: https://pixabay.com