Performance monitoring a cloud backend

While Data Science is something we’ve covered here in the past, measuring software starts from much more basic pieces. At some point during the development process of a cloud service, usually by the time you get ready for first public deployment, you’ll need to address the question of how you monitor the performance of the app and inspect it during troubleshooting. Some aspects of this will depend on specifics of your software and business, but there are a lot of standard best practices too. Lets start from there.

First of all, you will obviously need to collect and monitor the performance of the servers the app is deployed on. Fortunately this is a fairly trivial point in the day and age of cloud servers. Every major cloud platform provides virtual machine level metrics by default. On Amazon Web Services, this is CloudWatch, which at a basic level is a free tool, though you can purchase additional services for more detail.

Monitoring just the host level is the bare minimum. You will want to have both performance metrics and event log data from the app code as well, because it will be the only way you can correlate usage to performance. You’ll probably also end up having more than one server instance, and want to be able to both aggregate and slice-and-dice that performance data across the whole infrastructure. For these purposes, or if you’re running on dedicated hardware rather than the cloud, read on.

Performance monitoring on the app level requires a solution in a few layers:

  1. Collecting the performance data, typically with a monitoring agent embeded into the components.
  2. Aggregated storage of the metrics
  3. Visualization, eg dashboards and interactive reports
  4. Alerting for anomalies or thresholds being exceeded

Two great services which will take care of most things you’d need are New Relic and DataDog. New Relic is slightly more focused on being able to trace transactions deep in your app (awesome for debugging and profiling), while DataDog has a wider coverage of various integrations it can collect performance metrics from and flexibility in the dashboards you can produce from those metrics. Both are great, and you’re likely to be happy with either. These are comprehensive packages that will provide tools to every one of the points above. They also combine, and there are good reasons to use multiple tools, but try to keep things simple.

If you’re rolling your own, the two choices to look at for the mid layers will be Grafana and Kibana. Both are very capable, but at the time I’m writing this (in June 2016), Elasticsearch + Kibana have a slight edge over Grafana with easier set-up for alerting. In most circumstances, whichever platform you go with, the host agent of choice will be Collectd, which not only collects host metrics itself, but is also capable of aggregating app metrics, plus Logstash which will do the work of collecting both metrics and log data from multiple sources to forward it to almost any target.

But wait, the story isn’t over! How do you instrument your code for the performance metrics? And how do you manage the alerts and their subsequent actions? Lets tackle the latter first, because it’s really a no-brainer. PagerDuty is an awesome service, and you should not try to roll your own instead of it, because there’s a lot of ways you could fail, and you don’t want that “site is down, wake up the on-duty engineer” message to fail. You could get away with direct alert delivery if the engineering team is just yourself, but as soon as you’re sharing that work with someone else, PagerDuty makes life much nicer. You can of course also just send your alerts straight to Slack, but you’re going to want to have a bit more structure sooner or later.

The code instrumentation takes a bit more to cover. First, there’s the question of what your language of choice is. I’m most familiar with the JVM based frameworks, and for Java, Dropwizard Metrics is a pretty easy default. The same API has also been ported to Node.js. You can find similar packages for every major framework. The question of what metrics you should implement will take another post.