Google’s Approach to Observability

4 min readNov 12, 2017

The progression of microservices in the industry resembles me the way microservices progressed at Google. First, a common container format. Then, a way to express complicated systems in terms of containers. Tools to deploy them and services to schedule them. Core networking services to support the complicated networking requirements of our highly large systems with complex dependencies. Then, observability: collecting diagnostics data all across the stack to identify and debug production problems and also to provide critical signals about usage to our highly adaptive and scalable environment.

A significant core component in Google’s story was the instrumentation of our services and collection of diagnostics data. It has been 10 years we have been doing microservices instrumentation and we learned a lot along the way in terms of best practices, good patterns, UX, performance gotchas and security.

A cross-stack framework

At Google, an average service is likely to be depending on tens of other services at any given time. This results in the challenge of figuring out originator of the problems in the lower ends of the stack.

For example, they are layers and layers of software and services until someone hits the low-level storage service that every persistent service is depending on. Instrumentation at the storage layer is not very valuable if it is not recorded with enough context, so it can reveal the originator of the problem.

We have a concept called tags. Tags are arbitrary key-value pairs we propagate all across the stack. Tags are propagated from top to very bottom, and each layer can add more to add to the context. Tags often carry the originator library name, originator RPC name, etc. Once we retrieve instrumentation data from the low-end services, we can easily filter and point out what specific services, libraries or RPCs contributed to the state of the things.

A high level service, such as Google Analytics frontend server can tag the outgoing RPCs with originator:analytics. The tag will be propagated all the way down to the very low level blob storage service. This allows you to see the impact of your services on other services.

You can choose any cardinality to take a look at the final data collection, auto generate dashboards and alerts based on the bits you are interested in. As a user of the low-end storage service, you can easily only filter out diagnostics signals caused by your services.

Components

Observability is a multi dimensional problem. We currently provide stats collection and distributed tracing but it can be extended by other components.

At the foundational layer, context propagation sits. We provide a common context propagation method in each language or utilize the existing standards, such as context.Context in Go.

On top of the context propagation layer, we provide a library for the users to propagate and mutate the tags from the current context.

The larger components such as tracing and stats are on the top of these layers and can utilize what’s inside the current context and use the propagated tags when recording instrumentation data from those libraries.

Instrumentation is on by default

Our philosophy is to make instrumentation so cheap that users don’t need to think twice whether recording is on or not. Library authors also don’t have care or provide configuration to turn it on or off. We provide a fast mechanism to record things and drop them immediately if you don’t need to export the data. Given instrumentation bits are always in the final binary, users can optionally turn things on at the production time dynamically when there is a problem and see additional diagnostics data coming from services to understand the case.

Aggregation of data

We make it cheap by aggregating diagnostics data at the node and reduce the diagnostics data traffic. Tons of large scale products saved significant amount of resources once we started to aggregate data.

We provide highly efficient instrumentation libraries to aggregate and summarize data while it is still at the node. Aggregated data is then pushed to the backend.

White box and black box

The benefit of agreeing on a common framework that it creates an environment white box instrumentation is already baked in everywhere the same way and it also fosters an environment of integrations. Load balancers, RPC frameworks, networking services, etc. can easily auto instrument. Given the underlying instrumentation framework is the same, it is easy to start the instrumentation at an integration point (e.g. load balancer) and keep using the same library to add precise additional data.

For example, a trace started at the load balancer can be annotated at the user code because we share the same format and framework all across the stack.

The future of observability data

Observability data is providing clear and precise data about the usage and utilization. One of the most significant contributions the collected data can make is to help adaptive systems utilize their resources better. What else would be possible if load balancers and schedulers knew more about these highly precise diagnostics signals from the services they are serving to?

—

There is a clear gap in the world of microservices/distributed systems in terms of instrumentation and utilization of instrumentation data. IMHO, this field has so much potential to have significant impact on the evolution of distributed systems. At Google, we recently decided to rewrite our instrumentation infrastructure and do it in open with the community. Expect to hear more from us in this domain pretty soon and the code is coming soon too.