Metrics, Monitoring and Alarming at Soundwave
Soundwave is a music discovery app with around 1.5 million installs spread across most countries, and therefore most time zones. At a significant number of points in the day, somebody somewhere is travelling to or from work with Soundwave in their pocket. As a consumer application a single bad experience can mean a short cut to the trash can. Therefore at Soundwave we invest heavily in ensuring our users get the experience they deserve. This post talks about metrics from a systems and operational perspective; we’ll keep the KPI’s, day-14 retained and vanity metrics for another post. Articles about monitoring are typically mind-numbingly boring, so there are some tunes to ease the pain while you read. This article is certified free from statements like ‘correlation is not causation’.
If it can be measured it should be captured. Without this data we have no perspective outside our goldfish bowl. Metrics are surprisingly cheap to capture, inexpensive to gather and organise, and vital to success. They are the oft overlooked foundation to building and running the product that your users want. We use the excellent DataDog, complemented by Cloudwatch,Crashlytics and some internal tools to ensure first and foremost that we know as much as possible about all of our applications and services. Capturing a data-point can be as simple as firing a UDP packet at a server from a bash script. For more complex things such as histograms, there is a StatsD agent within touching distance of every Soundwave component — always available to aggregate, filter and push. Every service from our bash deployment scripts to our offline processing workers contribute metrics of various kinds.
USER EXPERIENCE METRICS
In building anything, start from the user and work backwards. The single most valuable category of metrics are those that describe the experience your users are having right now. If the landing page of your app takes 5 seconds to respond, your app is dead. Nothing else matters. Moreover, having an accurate aggregate view of user experience can inform decisions. Is nobody using a feature because it’s not responsive, or because it’s not the right feature? What’s the most frequently used feature? Is there a pattern of feature usage that leads to a dead end?
User experience metrics shade interpretations of other types of metrics. If I’ve maxed out CPU on 3 workers in EU-WEST-1c, the first action should always be to asses the user impact. Are some acceptable majority of our customers using Soundwave as normal? — if so then go back to bed. Is the the 95th percentile of your customers experiencing 10 second latencies on their profiles? — let’s page in some help. User experience metrics are your window to the eyes of your users.
Given any pool of metrics, start from your user and wade-in backwards.
Soundwave is a composition of many services that combine to help users discover what’s worth listening to. If a service function is either down or working sub-optimally it may not manifest itself directly in a user experience metric. This can be because measuring user impact is difficult. For example, how do we measure that we’ve served the wrong song clip to a user? Another danger is that an experience metric might be broadcasting a confusing signal. For example, latency on user profile is sub 100ms but is this only because we’re serving 10% of the data we usually serve? Service metrics add context to experience metrics. A service must be up to be functioning. Every instance of every Soundwave service sends a heartbeat to Cloudwatch. If Cloudwatch hasn’t heard from a service in some time, it triggers a no-data alarm and people get paged. Thereafter, every service must deliver some value in exchange for some cost. Service metrics paint a picture of this value vs cost trade-off.
Soundwave can synchronise your viewing history from YouTube. The service that accomplishes this is ugly in its nature. It polls the YouTube APIs for a subset of our users. It does this in batches of around 700 users in 5 minutes. About 1% of every 5 minute poll iteration delivers new data that hasn’t been seen before. This is a wasteful service that mostly does nothing, some of which can be put down to the polling. Its implementation keeps two small sized AWS instances fairly well loaded, 24×7. YouTube plays comprise less than 1% of the plays tracked by Soundwave. We can optimise the YouTube implementation along a number of axes, but this is likely to cost development time to add to the operational cost of a feature that our qual research shows our users are largely indifferent to.
In two graphs and a few metrics I’ve stated a serious case for the depreciation of the YouTube service. Service metrics allow us to paint this picture and drive broader conversations about features, development time and operational cost.
Know your service is alive. Deeply understand cost and value of all services.
Cloudwatch, by default, delivers 10 metrics per instance sampled some way every 5 minutes. StatsD system monitors give me another 30 per box, some at different rates, with different types of aggregates. It’s easy to get paralysed by this noise. On a box hosting an I/O bound service I don’t really care for CPU utilisation, but the boxes that host the workers that drain my queues need to be permanently loaded. It’s easy to ignore this level of metric completely and rely solely on service metrics. This might be a mistake.
Infrastructure metrics at their simplest can inform of impending doom. Some time ago in an office in Rathmines an engineer that should know better forgot to pay attention to remaining disk space on a primary of a MongoDB replica set. This spurred a 12 hour journey into the side effects of MongoDB failure semantics. It was unpleasant. Mongo died. We were completely down and needed to restore from backup quickly. Naturally, I provisioned the biggest-ass box I could find — 32-cores of bare metal. Boom! But no — it took 8 hours. Why? Metrics showed that only 1 of 32 cores was loaded. It turns out that building indexes on MongoDB is not paralliesable. Having this metric allowed me to take a screen-shot and post a snarky comment on twitter while I waited. So that was something, at least. These days at Soundwave we have picked a handful of really useful infrastructure metrics. There is no anomaly detection, or aggregate metrics made from 100s of disparate components. This handful is surprisingly effective at helping us keep the lights on.
Manually band-pass the gospel from noisy infra-metrics.
There’s a running joke at Soundwave that every time a start-up gets funded on the basis of ‘having data’ a family of parent(s) refuse to vaccinate their child. Of course metrics are just data-points until you interpret them. Metric data is the ore of your company, it requires careful manipulation to extract real value. With regard to running systems, monitoring is the manual or automated interpretation of metric data-points. Alarming is the action triggered from some interpretation. In running production systems, they will save your soul.
In an exceptional TED talk that brings boring old data to life Hans Rosling shows that a carefully crafted visual representation of any data can quickly confer thousands of words worth of signal to a reader. The Soundwave support dashboard gives a single-page birds eye view over our services. In a 30 second scan, I can check everything is operating satisfactorily, along with what events that have occured recently. This support dashboard serves up a multitude of information in a single page — from user experience metrics, to Jenkins builds. From MongoDB heartbeats to disk space. Its a comforting thing, to return to a dasboard and see it all lit up in green and operating normally. Its almost as comforting as running good unit test suite, to monitor a useful dashboard after shipping a new feature.
The graphic above shows the Soundwave birds-eye view dashboard. In one glance we have a nice picture of user experience metrics, some vanity metrics around key features like number of realtime messages sent, some comforting rate graphs that show various functions are working out, some nice vanity metrics — like signups. Coupled with these we have heart-beats from different flavours of service and an event stream that pulls together everything from deployments to pages. There’s a huge amount going on here, and each component has it’s own detailed view, but from a function-at-a-glance perspective, there is comfort in knowing even just a little of what’s going on across Soundwave.
Graphing data is the most compelling and useful noise filter.
Of course, it’s impossible to monitor dashboards constantly — at least if you want to get some useful work done. This is where alarms come in. At its most basic, an alarm triggers some action when some metrics has broken some bounds for some period of time. In reality, this can reflect something minor like a temporary blip in network connectivity, or the death of some vital application function. Understanding the difference, and alarming on such is an art in itself, one that it’s important to master. If your system is too sensitive, alarming on everything, then consumers of those alarms quickly become desensitised to them, ignoring the truly valuable ones. If there are no alarms, or the wrong alarms, adventures like our MongoDB one can be seen much more often than the black swan. Thinking of alarms as signals that represent app function is useful in defining their importance. For example, Play Processing Down is an alarm that directly aligns with app function. The key decision here is if this requires the immediate attention of an engineer and if so, to page or to email.
Alarming at just the right amount of fidelity is doubly important when your app needs to be up 24/7. Pages that wake up engineers at 2am subtract all the following days’ productivity from that engineer. Enough of those in a period and your engineering team turn into zombies that really do want to kill you. Considering the example alarm Timeline Feed is unacceptably slow. This has a tendency to happen around once every two weeks, around when the DB is busy processing overnight reports. Almost all of the time, it is explained by the load on the database and is temporary. Most of the time, this alarm was no longer triggered by the time an engineer came to intervene. Therefore, having alarm trigger kick off overnight was pointless. We are prepared to accept the chance of a minor loss of user experience for the sake of our engineers productivity. In fact that energy is better spent on fixing the root cause of this problem at source.
Proactively review the value, severity level and frequency of all alarms. In satisfying your users, don’t burn through your engineers.