How Google SRE Teams Monitor their Systems

Ash Powell
Glasswall Engineering
5 min readSep 16, 2020

Firstly, we need to know the definition and concept of monitoring that is dependent on the role it plays to achieve the system’s reliable performance.

Concepts of SRE Monitoring:

If we want to
· Keep our system healthy
· Maintain the inner workings of our system in an efficient and effective manner
· Achieve the overall reliability of our system

Then we have to analyse system performance, identify performance errors, and maintain service availability. In simple words, we need to see what’s going on in our system.

Commonly, the SRE Team is required to implement the required monitoring solutions, which then leads the SRE to have to decide what to monitor and how to do so effectively that the results will be accurate.

Google’s Definition:

According to the Google SRE book, monitoring is defined as “collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.”

Approaches of Monitoring According to Google’s SRE Teams

Different Google SRE teams have different approaches to monitoring, but all discussions about monitoring and all related decisions can’t always be the same, but most common agreements can be categorised as;

White Box Monitoring

In simple words, the monitoring of the internals of the system.
Monitoring of applications running on a server. Monitoring depends on metrics exposed by the internals of the system, including logs, interfaces such as a Java Virtual Machine profiling interface, or a HTTP handler that reports internal statistics.

Black Box Monitoring

Monitor externally visible behaviour of the system. Black box monitoring refers to the monitoring of servers with a focus on areas such as disk space, CPU usage, memory usage, load averages, etc. where you don’t have control and don’t know what’s happening inside the system. We can monitor the system only from the external view.

Dashboard

The dashboard is an information management application (usually web-based) that tracks and analyses data of your system and present/display metrics, key data points, and key performance indicators (KPI) to monitor the health of any system. The dashboard might also present information on every team such as ticket statistics, a list of high-priority bugs, the current availability of engineers for his specific area of responsibility. Commonly, it displays all this data in the tabular form, charts (line charts, bar charts), and gauges.

Alerts

A notification that is designed/planned to be read by a human and that can also be sent to systems like a ticket queue, email, or text message.

Root Cause Analysis

When any uncertain incident or a defect in our system occurs, we have to examine what factors are involved in this happening, that process is root cause analysis. For example, maybe the bug had been caused by a mixture of poor method automation, functionality that crashed, or poor configuration generation.

The Four Golden Signals of Google SRE Monitoring:

Effective implementation of the core parts of SRE requires visibility and transparency across all services and applications inside a system. But, measuring the service performance and accessibility of different services on one scale is not an easy thing. So, Google’s SRE teams developed the four golden signals as a simple way to track service health across all applications and frameworks regularly.

So, let’s take a look at SRE’s golden signals to know why the reliability of any system is dependent on monitoring them.

Latency

Latency is the time taken to process the sending a request to our system and receiving its response. Mostly latency is calculated/measured from the server-side though it can be measured from the client-side, in some scenarios. The most important latency metrics to monitor are both successful requests and failed requests. An example from Google SRE Book, a HTTP 500 error was triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as a HTTP 500 error indicates a failed request, factoring 500 errors into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

Traffic

Traffic is the measure of the number of requests on our system (the demand on your system). For example, if we are dealing with a web service then the traffic will be the number of HTTP requests per second. If we have an audio streaming system, then it is the measure of I/O rate of a network or concurrent sessions.

Let’s take a look at some examples from IBM (www.ibm.com). Examples of traffic include the number of requests that an API handled, the number of connections to an application server, and the bandwidth that was consumed to stream an application.

Errors

Monitoring errors is about monitoring the rate of failing requests, either explicitly or implicitly. HTTP 500s (straight forward) is considered as explicit and HTTP 200s (coupled with the wrong content) is considered as implicit. Every error should be monitored through the monitoring platform for a healthy system.

Looking at this example (fetched from the Google SRE book) “If you committed to one-second response times, any request over one second is an error”. Errors can tell you about misconfigurations in your framework, virus/bugs in your code, or maybe even broken dependencies. Some SRE teams are now using incident management software to alert on vital errors, take action to spot why an error is occurring, and work toward a quick resolution.

Saturation

This indicates how loaded or busy the service is on your network/server, focusing on the most constrained resources. Every resource has a limit, after which performance will throttle or become unavailable. Note that many systems degrade performance before they get to 100% utilisation, so having a utilisation target is very important.

If you measure all four golden signals and alert one of your team when at least one signal is having issues, then your service should be adequately covered by monitoring.

--

--