Google Cloud DevOps Series: Observability with SRE principles

Google Cloud DevOps Series: Part-5

Pushkar Kothavade
Google Cloud - Community
4 min readDec 1, 2021

--

Welcome to Part 5 of the Google Cloud DevOps series.. You can find the complete series Here

As Samajik becomes more agile and moves towards increasing development velocity, Samajik DevOps team would want to make sure that they respond to urgent issues that are impacting customers and not reacting to minor issues and being woken up in the middle of the night. To accomplish this objective, many organisations are adopting the SRE model defined by Google.

Reference: https://sre.google/books/

GKE Platform Observability-Logging and Monitoring (Demo)

In the earlier parts of this series we have created multiple GKE clusters. Let’s understand Observability techniques using the Production cluster.

One of the key benefits of ‘Google Cloud Operations’ is the ability to collect Out-of-the-box collection of system metrics, logs, and rich context without any agent deployment.

Salient Features:

  1. Key metrics for GKE clusters, nodes, namespaces without installing agents.
  2. Default collection of application logs
  3. Metadata and relationships amongst GKE entities objects

Navigate to the Operations-Monitoring and click on the Dashboards. It will show Dashboards for all Google Cloud services.

Google Cloud Monitoring Dashboards-All Services

Navigate to the Dashboards section and click on the GKE

Navigate to the Services section. On this ‘services overview’ page, there is a list of services and a summary of services for which alerts are firing on SLOs as well as services that are out of error budgets. Observe that currently there are no SLOs defined for the services.

Google Cloud Monitoring Services Dashboard

Let’s define the Service first. Go to the DEFINE SERVICE option and select the service for example ‘frontend’ service

Before we can set an SLO, the first thing we need to do is to identify a Service Level Indicator on which we create an SLO. Click on the ‘frontend’ service and select the ‘Create SLO’ option. Follow the steps given below.

Step-1: Set your Service Level Indicator by selecting either Request-based or Window-based option and then click on Continue.

Step-2: Define Service Level Indicator details by selecting the appropriate Performance metric eg. ‘kubernetes.io/container/restart_count’

Step-3: Define your Service Level Objective by selecting ‘Compliance period’ and ‘Performance goal’

Step-4: Review the settings and Create the SLO.

Navigate to the Operations-Logging section. Logs of the desired microservice can be explored using the Query in search section.

You can refer to the SRE implements DevOps Series on YouTube to understand more about SRE principles.

Reference: https://www.youtube.com/watch?v=uTEL8Ff1Zvk&list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj

Coming up…

In this blog, we learned SRE principles and how to implement them using Cloud Operations tools. Guhan was impressed with the DevOps philosophy and Google SRE practice. Guhan and his team at Samajik implemented not just the CI/CD and developer workflow but Observability as well. Let’s stay tuned for the discussion which will cover Agility with Cost-optimisation…

Contributors: Shijimol A K, Dhandus, Anchit Nishant, Tushar Gupta

Update: You can read Part-6 here.

--

--