Observability-Driven Development: The WHAT, The WHERE, & The WHY!

Published in

mStakx

4 min readAug 6, 2018

We have been exploring Observability as a concept/philosophy, its use-cases and ultimately its application in a product’s software engineering life-cycle. One of our earlier stories gave an overview of the Observability, while the other presented a use-case for an engineering team of small startup. You can read Story#1 and Story#2 of our shared learnings so far.

The 3 building blocks of any Observability framework are:
1. Instrumentation (Edge Collection): Simply put, the responsibility of this block of the framework is to collect the logs, metrics and traces
2. Stack (Data Storage): Stack is where all the collected data (logs, metrics, and traces) is indexed and stored
3. Visualisation (Analysis): This block presents the collected data in a form that is useful for analysis. This is where the collected data is used to create dashboards which provide analytical insights of the system and (business) application

The visualisation layer queries its data from stack layer, which in turn stores the data provided by the instrumentation layer. The consumers of the visualisation layer range from the top-management (business), development teams, to SRE (Site Reliability Engineering) teams. Hence, it becomes imperative to have clear understanding of WHAT data needs to be collected by the Instrumentation layer, WHERE should such data be collected and WHY is such data important.

As it turns out; these questions are not new. The SRE teams have been dealing with these questions for years now, and to our advantage, several SRE workbooks have addressed this in depth.

The WHAT!

Building up from the experience of SRE teams we’ve come to a conclusion that the USE, and RED methods give a fair understanding as to what type of events need to be collected for an effective Observable system. Let us understand both of these methods in brief.

USE stands for: Utilization | Saturation | Errors

The USE method applies to hardware (network interfaces, storage disks, CPUs, memory etc.)
- Utilization: the average time that the resource was busy servicing work
- Saturation: the degree to which the resource has extra work which it can’t service, often queued
- Errors: the count of error events

RED stands for: Rate | Errors | Duration

Since the USE method doesn’t really apply to services, the RED method addresses the monitoring of services.
- Rate: The number of requests per second
- Errors: The number of those requests that are failing
- Duration: The amount of time those requests take

The Google SRE team has also defined ‘4 Golden Signals’ which prove to be insightful when collected.
Those 4 Golden Signals are: Latency | Traffic | Errors | Saturation

- Latency: The time it takes to service a request
- Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric
- Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”)
- Saturation: How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O)

Now that we’ve roughly answered the WHAT, let’s take a shot at the WHERE & the WHY!

The WHERE!

Good places to add instrumentation (collect data) for the system are at points of ingress and egress. For instance:
1. Logging requests and responses specific webpage-hits or API endpoints
(If you’re instrumenting an existing application then make a priority-driven list of specific pages or endpoints and instrument them in order of importance.
2. Measure and log all calls to external services and APIs.
For e.g: Calls (or queries) to database, cache, search service etc.
3. Measure and log job-scheduling and the corresponding execution. E.g. cron jobs
4. Measure significant business and functional events, such as users being created or transactions like payments and sales.
5. Measure methods and functions that read and write from databases and caches.

The WHY!

A few reasons that make ‘observing the software system’ a good idea are:
1. It enables the engineering teams to identify and diagnose faults, failures, and crashes
2. It enables the engineering teams to measure and analyse the operational performance of the system
3. It enables the engineering teams to measure and analyse the business performance and success of the system or its component(s)
Thus, avoiding a serious business and operational risk.

While writing this story we’ve referred to the following books:

‘The Art of Monitoring’ by James Turnbull

‘The Art of Monitoring’ by James Turnbull
‘Seeking SRE’ by David N. Blank-Edelman

These reference books gave us detailed insights into how SRE teams operate, and helped us get closer to the answers for the WHAT, the WHERE and the WHY in our quest to build effective Observability platforms of our client businesses.

Note: Read our story on how we built an open-source Observability boilerplate here!