Stackdriver Debugger Safety Features
I’ve been playing around with Stackdriver Debugger lately. It’s a very useful tool for trying to understand those bugs that only seem to happen in production. Stackdriver Debugger can debug your application while it’s running on your own hardware, on Google Cloud Platform, or on another cloud provider. Today, it’s generally available for Java and Python. It’s in beta for Ruby, Node.js, and Go. It’s in alpha for PHP. .NET support is coming soon. If you’d like to join the Early Access Program for .NET, leave a comment.
Of course, my first fear when using a debugger in production is, “Will this slow my program down?” After talking to the product manager for Stackdriver Debugger, my fears have been allayed. He shared a lot of technical details that explain why Debugger’s performance and stability impact is absolutely minimal.
Stackdriver Debugger is a powerful tool that allows developers to effectively set breakpoints and add log statements to production applications, without impacting a service’s end user experience. Ensuring that Debugger does not affect a production service’s stability or latency is part of what makes Stackdriver Debugger unique, and we’ve made major investments to ensure that this remains true.
We also use Debugger inside of Google, meaning that it has to be able to perform at extreme scale. These same improvements that we’ve made for Debugger to work on Google services like DoubleClick are present in Stackdriver Debugger.
While this document focuses on the features present in the Java debugger agent, we have similar capabilities across most languages.
Guaranteeing Fast Performance
Firstly, it’s important to note that while snapshots created in Stackdriver Debugger are very similar to breakpoints, they do not halt application execution. Rather, the application state (variables and call stack) are captured but execution is allowed to proceed unimpeded.
However, an individual snapshot does impart a small (on the order of milliseconds) delay during capture. To ensure that this performance penalty is minimized across multiple instances of the same service, Debugger deletes a snapshot instruction across all instances of your application as soon as it has triggered on any of them. Additionally, the data collected from a snapshot is capped at 64KB, in order to reduce the impact of capturing a snapshot. If an important variable is not captured in the 64KB snapshot, you can prioritize it through the use of an expression.
Snapshot conditions also have the potential to slow down your application’s execution, as they must be evaluated each time the snapshotted line of code is run. To combat this, the Debugger agents monitor their own CPU consumption, and will automatically remove snapshots if they ever use more than 1% of CPU time on a particular instance of your application. If this occurs, you’ll receive a message in the Debugger UI and any deleted snapshots will be automatically removed across all instances.
Logpoints also features similar protections, though it’s important to keep in mind that logpoints don’t introduce any appreciable overhead versus a log statement included in an application’s original source code. However, as Logpoints skip the normal test and release cycle, it’s important for us to guarantee that services won’t be negatively affected by adding them, which is why they’re capped at 50 logs statements and 20KB of logged data per second per instance.
Preserving Application State
While not impacting application performance is an incredibly important goal, Debugger also guarantees that it won’t affect an application’s state during execution. If you use Java, this is particularly important, as we allow Java developers to use methods within conditions and expressions.
To solve this, we created a custom Java interpreter to execute and analyze the functions included in a condition and expression, and determine if they have any potential side effects. For example, modifying a static member of a class has a side effect and would be rejected, but creating or modifying a temporary variable is side effect free and can be used within a condition or expression. The application classes contained within a condition or expression are not loaded into the JVM, but loaded into this interpreter instead, which further eliminates class loading related side effects.
Conclusion
At Google, we have to consider the performance and reliability impact of any instrumentation that we add to our code base. As Debugger is used daily on production Google services, we had to ensure that its impact is as minimal as possible.
Stackdriver Debugger users get to take advantage of these same innovations, regardless of whether their applications are deployed to Kubernetes, VMs, or Google App Engine. As with the rest of the Stackdriver APM suite, Stackdriver Debugger can target applications hosted anywhere. Give it a try, and let me know what think!