Observability: Deliver Reliable Software Faster

We still live in an era where most of the computer behavior is dictated by human generated code.

One of biggest errors one can make is to assume that such code is bullet proof based of the myths that people take for granted e.g.: works on my machine, it has 100% test coverage, and such.

is it still working?

is it still working?

The impact of the assumption that once the code ran, it will work forever is the unawareness of unexpected behaviour, which, to some extend, should be already covered by current testing methodologies, considering that code already doesn’t work 100% of the time.

A quick search on Google for “software bug” demonstrates how badly we fail on proving the quality of our code, from a low-level kind of bug such as Meltdown up to self-driving car issues.

Writing code for professional usage presents a very overwhelming challenge:

How does one ensure its code works the way it was designed to?

Innumerous answers to that question emerged over the course of the software development history. From out-of-the-box tools up to philosophies on how should we ensure quality of our systems.

Most software teams today and will say they ensure quality by having QA Engineers, or 100% coverage on their test suite, TDD, BDD, smoke tests, … any of those silver bullet buzzwords methodologies.

While some of them do help to improve quality to some extend, most of them are good to create software but are not great for ensuring quality over time. There are tons of articles demystifying coverage and some other test practices.

Bottleneck

Writing integration tests across multiple distributed services, databases, 3party APIs, environments, browsers, operating systems, device versions, screen sizes,… covering a huge amount of scenarios creates a big operational bottleneck. And we haven’t talked about performance yet (speed, scalability …)

Sure, one can spend a lot of human-work implementing test cases for all of that, but at what cost? Both creating and running those will impact directly on the time to market.

Besides, browsers change, device change, servers change… there is just too much that can change, errors will happen.

At the end, our goal is to find ways to detect anomalies or unexpected behaviour on our code over time.

Perspective is everything

Just as we have started using software development practices in order to manage infrastructure resources (IaC), couldn’t we get inspired by other practices or areas in order to improve our awareness of unexpected behaviour on our systems?

Let us define what we need

Considering that one ran its code at least once, in a single enviroment and it proved right. A bug is nothing but an unexpected behaviour of one’s code that happened over time under a certain condition or group of conditions.

What we want is to detect when and under which conditions such thing can happen. But, a bug doesn't always presents itself in the companion of a clear error or exception.

Since we don’t have a clear event of the bug, it leads us to believe we need to observe the state of our system at all times, so we have a baseline for when it operates under normal conditions. That way we can assume that anomalies on the external state are probably related to bugs or unexpected behaviour recently introduced.

If we think about it, we’ve seen this somewhere else… yes, that’s basically monitoring, but, with goal-related focus, not limited to the monitoring of the infrastructure, but its internal components behaviour.

Happily, that is quite similar of a known concept for electrical systems, called observability, which is part of the control theory.

Control theory

Control theory in control systems engineering deals with the control of continuously operating dynamical systems in engineered processes and machines. The objective is to develop a control model for controlling such systems using a control action in an optimum manner without delay or overshoot and ensuring control stability. — Wikipedia

Note that, at this point, we only care about one specific part of the control theory, Observability:

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs — Wikipedia

It is quite self-explanatory. Both control theory, and, observability might seem complex, whereas they’re actually quite simple. Basically, it implies that a system should be aware of its external state in order to validate its integrity.

If we consider the external state being the measure of success when accomplishing an interaction with your system, the internal state being all the layers of your code/infrastructure it touches to achieve this interaction, and the integrity being the analysis of all the states and the conditions in order to define normal operation, we can then define an observable system.


Application

In practical terms, that means you don’t need to rely only on tests to define if your code continuously does what it is supposed to do over time.

You can create checkpoints of success for your actions and push those to a controller, which is an isolated part of your infrastructure responsible for the collection and reasoning of interaction signals.

e.g.:

Imagine a real world feature: “my users need be able to search at any time”. At the success branch of your code, you write a small push to your collector, let it aware that at this time one search has been performed.

After collecting data for a reasonable amount of time, you can start reasoning about it:

my users usually search 10 times per second from 8am to 10pm.

And you can start reacting to it, creating alarms and triggers for when it stops acting as it was supposed to:

if my users search less than 8 times per second from 8am to 10pm something is wrong, trigger an alarm.


Reasoning

When we introduce a new change we want to make sure of the continuous integrity of the system. Basically, if the change doesn’t introduce a new horizon, you should be able to continue answering the questions:

“Can the users still search?”

“Can the users still ___?”

“How long it takes for the user to ____?”

We want to be able to collect metrics that are useful to identify which part of the system is being affected by an anomaly.

Therefore, here is a small template on how to apply this for other features:

What does this feature do? It ______.
Collect the amount of times it _____ /s.
Collect how long it takes to _____ . (if applicable, e.g. cpu/io/network intensive)
Monitor how many times it usually ____ /s on normal operation.
Trigger an alarm if it doesn’t __/s at least _% of the normal operation state.

Implementation

The outcome of this article is much more the effects than the implementation itself, but it would be obnoxious to talk about the theory only, and leave you hanging.

It is important to remember that we are trying to cover something somewhere complex in regards of infrastructure, the real time collection parsing of data. Yes, real-time, or as close as possible, because observability has to give you time to react, which will be covered on a next article.

For that, it is strongly recommended to not create your own solutions, where there are both enterprise solutions and open source.

Constraints

Hosting presents itself with a considerable operational bottleneck, besides, as a monitoring principle, is as well advised to not host your such infrastructure on the same server/cluster groups as your production infrastructure.

Stacks

For a starter to an intermediate level, Datadog is a very nice solution, both from the financial, and the practical side.

They provide a huge amount of integrations for collecting metrics, as well as rich dashboards for real-time analysis and of course, alarms.

It is fairly easy to push metrics from your applications using statsd, with a very small footprint.

It will probably get to a point where datadog is too expensive, or it doesn’t provide customising thing for your own needs. In order to scale to that another order of magnitude there are very good open-source possibilities. Considering that they use standard protocol, such as statsd, it is not painful to migrate other than the pain of self-hosting.

Prometheus is a very nice solution, but it will require a bit more thought and effort on how to build and integrate with dashboard solutions, such as Grafana.

At the end of the day, is a trade-off, the suggestion is to move on best-effort bases and evolve when you face a strong limitation, being financial or technical.


Should I stop testing?

After approaching this subject with a couple of colleagues, lot of them wondered if this intensive observed system needs tests, considering that every change introduced is going to be strongly monitored.

Well, not necessarily, it depends a lot on what you’re building and how much can you afford mistakes.

Meaning that, if you work on things that have a strong bound to extreme consistency or extreme availability, you probably can’t afford mistakes. Which is not the case for most companies.

Your continuous delivery strategy will play a big role here, the way you rollout and validate your releases is still up to you. It’ll potentialise you to have a very neat control if things run as planned and take action.

At the end, regardless if you decide to drop intensive testing because you can afford to make mistakes, or not, you will always profit from having an observable system.

It’s always better than the alternative…

There are several touch points of the impact of such a change on mindset and how to minimise the risk and potencialize the gains of observability.

The next article(s) will be about controllability, the reasoning of the metrics, how to mitigate the risks of rollouts and react to changes that may occur to your system.