Continuous Profiling and Go
There is also a version of the post in Russian.
pprof is the main tool for profiling Go applications. It’s included into Go toolchain, and over the years many handy articles have been written about it.
Enabling pprof profiler for an existing Go application is fairly simple. All we need is to add a single import line:
import _ "net/http/pprof"
The profiler will inject its handlers into the default HTTP server (
net/http.DefaultServeMux), serving the profiling results under the “/debug/pprof” route. That’s it. One curl command and we have, for example, the results of CPU profiling:
curl -o cpu.prof "http://<server-addr>/debug/pprof/profile"
Sure, enabling pprof seems trivial. But in practice, there are lots of hidden details we should take into consideration when profiling production code.
Let’s start with the fact that we absolutely don’t want to expose “/debug” routes to the internet. Profiling with pprof doesn’t add much overhead, but being cheap doesn’t mean it’s free. A malicious actor can start sending a long running profiling query, affecting the overall application performance. Moreover, profiling results expose details about the application’s internals, which we never want to show to strangers. We must make sure that only authorised requests can reach the profiler. We could restrict the access with a reverse proxy that runs in front of the application or we can move pprof server out of the main server, to a dedicated TCP port with different access procedures — there are ways of doing it.
But what if the business logic of the application doesn’t imply any HTTP interactions? For example, we build an offline worker.
Depending on the state of the infrastructure in the company, an “unexpected” HTTP server inside the application’s process can raise questions from your systems operations department ;) The server adds additional limitations to how we can scale the service. That is, the processes that we could clone on the host, to scale the application up, would start conflicting trying to open the same TCP port for a pprof server.
This is another “easy to fix” issue. We can use different ports per instance, or wrap the application into a container. Nowadays, no one will be surprised with an application that runs over a fleet of servers spread across multiple data centres. And in a very dynamic infrastructure, application instances can come and go, reflecting the workload in real-time. But we still need to access the pprof server inside the application. Meaning, such a dynamic system would have to provide extra mechanisms of service discovery to allow a developer to find a particular instance (and its pprof server) to get the profiling data.
At the same time, depending on the peculiar nature of a company, the very ability to access something inside a production application, that doesn’t directly relate to application’s business logic, can raise questions from the security department ;)) I used to work for a company with a very strict internal security regulations. The only department that had access to production instances were people from production systems operations. The only way a developer could get profiling data was to open a ticket in the Ops bug tracker, describing “what command and on which cluster should be run”, “what results should be expected”, and “what should be done with the results”. Needless to say, that the motivation to do production performance analysis was pretty low.
There is another common situation that a developer can stumble over. Imaging this: you open Slack in the morning and find that last night “something bad happened” to an app in production: maybe a deadlock, a memory leak, or a runtime panic. Nobody on-call had time or energy to delve deep into the problem, so they restarted the application, or rolled back the last release, leaving the rest to the morning.
Investigating such cases is a tough task. It’s great if one can reproduce the issue in the testing environment, or in an isolated part of production, where we have all the tools to get all the data we need. We can take our time delving through the collected data, figuring out what component has been causing problems.
From personal experience, understanding and reproducing the problem in testing is usually a challenge by itself, as in practice, the only artefacts that we are left with are metrics and logs. Wouldn’t it be great if we could go back in time to the point when the issue happened in production and collect all runtime profiles. Unfortunately, to my knowledge, we can’t do that.
But, because we know that profiling with pprof is computationally cheap, what if we, knowing all the possible pitfalls, periodically collected the profiling data and stored the results somewhere outside of production?
In 2010 Google published a paper, titled “Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers”, which described their approach to continuous profiling. After several years, Google also released Stackdriver Profiler — a service of continuous profiling, available to everyone.
The way Stackdriver works is fairly simple: instead of pprof server, a developer includes a “stackdriver-agent” into the application, which — using pprof API under the hood — periodically runs the profiler and sends the results to the cloud. All the developer needs to do is to go to Stackdriver’s UI, choose the instance of the application and an availability zone, and then they can analyse the application’s performance at any point in the past.
There are other SaaS companies, who provide similar functionality. But because of different reasons, not everyone can or wants to send any internal data to a cloud outside of company’s own infrastructure ;)))
Everything described above is not something new or something specific to Go. Developers in most companies where I have worked faced similar obstacles in one form or another.
At some point, I became interested in building a system similar to Stackdriver Profiler, that would be applicable to a Go application. One that could help developers overcome all described difficulties. Since then, as a lazy-side-project, I have been working on profefe, an open system for continuous profiling. The project is still in an experimental state but is ready for early testing and feedback.
For profefe I’ve defined the following design decisions and goals:
- The service will be deployed to company’s own infrastructure.
- The service will be used as a part of an internal toolset. All suppliers and consumers of data are trusted, meaning we can postpone the questions of read/write authorisation and skip protecting the service from malicious usage.
- The service must not have any special assumptions about the underlying infrastructure. That is, the company can live in a cloud or run its own data centres; everything can run in containers or on bare metal, etc.
- The service should be easy to deploy and operate (I feel that, at some degree, Prometheus is a good example here).
- There is no “one size fits all” solution, and with (4) in mind, the fewer third-party services required, the better. For example, the service will need to find a target application to get the profiling data. But requiring an external service discovery and coordination systems from the start seems unreasonable.
- The service will be used to store and to catalogue pprof profiling data. The following numbers should be good as a starting point: a single pprof file consumes 100KB — 2MB of storage (heap profiles are usually much bigger than CPU profiles). There is no need to profile an application’s instance more than N times per minute (as an example, Stackdriver’s agent, on average, takes two profiles per minute). One profiled application can be scaled to hundreds of instances: 3–300 instances is a very realistic number.
- The service will be used to query profiling data. One should expect the following types of queries: list profiling data for a given type (CPU, heap, etc), for a particular instance of the application, for a particular period of time.
- The retrieved profiling data must be available to be viewed with existing pprof toolings.
profefe consists of the following components:
- profefe-collector — the collector of profiling results with a simple RESTful API. Collector’s goal is to receive pprof file with some metadata and store them to a pluggable persistent storage where it will be indexed. The API provides a way to query the profiles by metadata, in some time frame, and to read the profile from the storage.
- profefe-agent — an optional library, that a developer needs to integrate into an application, replacing pprof server. The agent runs in a separate goroutine, periodically collects profiling data from the application (directly using
runtime/pprof), and sends the data to the collector.
Above, metadata is simply an arbitrary set of key-value pairs that describes the running instance: service name, version, DC, AZ, the host, etc.
As I’ve pointed out, the agent is an optional component. If integrating the agent into the application is impossible or undesirable, but the application already exposes pprof server, the profiling data can be scraped using any external tools and can be pushed to the collector over its HTTP API.
For example, a script that is running in a cronjob can periodically request profiling data from application’s pprof server and send the data to the collector:
To read more about profefe’s HTTP API, have a look at the documentation on GitHub.
Some future plans
As it’s now, the only way to interact with the collector is its HTTP API. It would be nice to build an external UI service, that will help visualising the stored data.
It’s never enough to store the data. The data needs analysis. With the profiling data collected and stored over a long period of time, we can efficiently analyse how the performance of the application changed in result to updates of application’s dependencies, refactoring, or changes in the underlying infrastructure. It would be interesting to build such tooling as part (or on top) of profefe.
There are other ideas and open questions that are described in wiki notes and issues on the project’s GitHub.
With the increase in momentum around the term “observability” over the last few years, there is a common misconception amongst the developers, that observability is exclusively about metrics, logs and tracing (“three pillars of observability”). But the observability of a system hangs on our ability to answer all kinds of questions about the system. With metrics and tracing, we can see the system on a macro-level. Logs only cover the known parts of the system. Performance profiling is yet another signal that uncovers the micro-level of a system and continuous profiling allows observing how the components of the application or the infrastructure it runs in, have influenced and continue to influence the overall system.
Since I started work on the project, several people have told me about similar services they already use inside their companies, but the overall topic of continuous profiling is still fairly underrepresented on the internet. I collect some references, ideas and related projects, so if you’re interested in the topic, have a look at “Further reading” in profefe’s README.