Tracing Google Cloud
We think it should be easier to debug our platform. We think latency is a critical measure to say if a service is running as expected. We think it should be easier to understand service topologies. We think it should be easy to follow all significant events/requests in the lifetime of a user request end-to-end. Our users agree.
A common need on Google Cloud Platform is to able to debug production services much better. When there are many moving parts involved, it might become complicated to pinpoint the root causes of outages.
A while ago, we introduced Stackdriver Trace. Stackdriver Trace allows us to analyze a user request end-to-end and trace all internal requests until user gets a response. We have also recently open sourced parts of our instrumentation stack at Google, called OpenCensus. OpenCensus is designed to be not tied to any Google technology. It can upload metrics and tracing data to any provider or to open source tool. For us, portability of user instrumentation is a core fundamental goal.
Even though, releasing these tools is helping our ecosystem, we think we can do more. We recently started a few initiatives to provide a more compelling and useful data from our platform out-of-the-box, so our users can see and benefit from the availability of tracing data without any work. The overall goal is to provide all traces platform can generate and allow users to participate with custom instrumentation.
- Envoy/Istio is generating HTTP and gRPC traces automatically.
- For users who don’t prefer the sidecar model, OpenCensus framework integrations primarily for HTTP and gRPC are generating HTTP and gRPC traces automatically.
- Our upcoming serverless building blocks are instrumented with OpenCensus and allows users to participate in the trace using the same libraries.
- Google Cloud Libraries are instrumented with OpenCensus and can trace your existing traces.
- Frameworks designed and built for portability like Go Cloud are utilizing our portable instrumentation stack to collect and report traces.
- We are looking for ways to trace our data pipeline products to give you more visibility into the internal mechanisms of data processing infrastructure.
Allowing visibility at the platform levels helps our users to debug and determine unexpected events quickly. One other advantage is that it becomes easier to say whether outage is rooted at the platform or at the user code. User traces can internally be traced and it gives our engineers an easily way react to the customer problems and analyze the impact on our infrastructure.
Cloud Client Libraries are trying to do the best to provide the underlying details. Above, you see the handler is trying to Apply a few insertions. For most of our users, Apply is a blackbox until they see the traces and the entire dance required to commit a transaction.
import "cloud.google.com/go/spanner"mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
_, err := spannerClient.Apply(r.Context(), []*spanner.Mutation{
spanner.Insert("Users",
[]string{"name", "email"},
[]interface{}{"alice", "a@example.com"})})
if err != nil {
// TODO: Handle error.
}
// TODO: Handler code.
})
The beauty and simplicity of our smooth integrations is that you only pass the current context around and the libraries are doing their job to keep tracing.
If you want to add some custom instrumentation, it is fine. You simply use the tracing library with the current context and immediately you can participate in the current trace.
import "go.opencensus.io/trace"mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
_, span := trace.StartSpan(req.Context(), "image_processing")
// TODO: Do image processing and add annotations.
span.End()
})
See the image_processing span is appearing in the handler trace with user’s custom annotations.
We are trying to do best to gather from language runtimes and underlying components we depend on to enrich our tracing data. For example, Go requests can be annotated automatically with the networking events.
You are seeing a cold start of a server that makes a request as soon as it handles `/`. You can get very precise data on all low-level networking events occurred in the life of the outgoing request.
A second request to the same server’s same endpoint gives us a slightly different response. You can see that DNS is entirely omitted on the second request because we have to the DNS result cached from the earlier request.
Data at this granularity level is certainly helping us and our users to debug unexpected networking events. We want to enhance our capabilities as much as the layers we depend on allows us to do.
Next steps
We are willing to make this smooth experience the default. We want to make it easy by providing platform traces and allowing users to trace platform traces simply by using the current context.
We want to utilize more sources of information and work on better strategies to visualize the collected data.
We are significantly improving the Stackdriver Trace UX. Feel free to chime in here and contact me at jbd@google.com if you want to do it privately.
We want to build better connectivity between monitoring data and traces, traces and profiles, traces and logs and traces and errors. We want you to be able to analyze the topologies of your systems by looking at the trace data. See a review of these features at the GCP blog.
We want better intermediate tools to capture 95th, 99th percentile cases even if you prefer to aggressively downsample and make sure we always have traces generated for them.
We want our users to have flexibility to upload their data to their choice of vendor or tools — not just to Stackdriver.
We want our users to be able to have high visibility into their own systems and the systems they depend on such as ours.