Application performance management with Stackdriver

Colt McAnlis
Oct 17, 2018 · 3 min read

Understanding the performance of production systems is notoriously difficult. Not only is it difficult to replicate the exact scenario in which the product is being used in the wild, but adding profiling adds an additional load on the production system, which could inadvertently hurt your performance.

This is a catch-22, since if you don’t analyze code execution in production, unexpectedly resource-intensive functions increase the latency and cost of web services every day, without anyone knowing or being able to do anything about it.

This predicament is where many of us find our mental energy : We need to constantly perf-test in prod, but doing so hurts our overall performance.

The trick to solving this is finding a system which can do profiling across all of your deployments while keeping the overhead low.

This is where Stackdriver Profiler comes in.

Stackdriver Profiler

Stackdriver Profiler is a statistical, low-overhead, sampling-based profiler that continuously gathers CPU usage and memory allocation information from your production application instances.

It attributes that information to the source code that generated it, helping you identify the parts of the code that are consuming the most resources, and otherwise illuminating the performance characteristics of the code.

Plus its got a few other features:

  1. It analyzes code execution across all environments.

Let’s take a look at it a little closer.

Using it.

Just to show a few things off, I’ve deployed a small Go application in Google Cloud Shell, which imports the `cloud.google.com/go/profiler` package and starts the profiler at the beginning of the application:

Then we just let the program run. Everything else is being taken care of in the background.

Next, in the Google Cloud Platform Console dashboard, go to Profiler:

Which will bring you to the profiler interface.

It then displays this data on a flame chart, presenting the selected metric (CPU time, wall time, RAM used, contention, etc.) for each function on the horizontal axis, with the function call hierarchy on the vertical axis.

You can do cool stuff like CPU, Heap, mutex usage, thread data and other fun things. The UI lets you combine and scrub this data over a nifty time range.

What’s the overhead?

Stackdriver Profiler collects profiling data for 10 seconds every 1 minute for a single instance of a configured service in a single Google Compute Engine zone. For example, if your Kubernetes Engine service runs 10 replicas of a pod If, for example, your Kubernetes Engine service runs 10 replicas of a pod, then each pod will be profiled approximately once in 10 minutes.

The overhead of the CPU and heap allocation profiling at the time of the data collection is less than 5 percent. Amortized over the execution time and across multiple replicas of a service, the overhead is commonly less than 0.5 percent, making it an affordable option for always-on profiling in production systems.

More info

To get started with Stackdriver APM, simply link the appropriate instrumentation library for each tool to your app and start gathering telemetry for analysis. Stackdriver Debugger and GitHub Enterprise and GitLab, adding to our existing code mirroring functionality for GitHub, Bitbucket, Google Cloud Repositories, as well as locally-stored source code

Colt McAnlis

Written by

DA @ Google; http://goo.gl/bbPefY | http://goo.gl/xZ4fE7 | https://goo.gl/RGsQlF | http://goo.gl/4ZJkY1 | http://goo.gl/qR5WI1