Observability and beyond — building resilient application infrastructure

Published in

cloudnativeinfra

11 min readOct 23, 2019

The journey from being reactive to being proactive

Things were quite simple in the old days. Proactively monitoring applications or infrastructure was not the norm. If there was a failure, a user would pick up the phone to inform help-desk that the app is broken.

Troubleshooting was all reactive and the only path to resolution was for someone to roll up their sleeves and go in and look at log files and manually fix errors by themselves.

Luckily, those days are now over. We now live in the age of observable applications running on observable infrastructure. Failure costs time and money to fix and ruins brand value. Uptime, scalability and the ability to identify and fix errors are the factors that determine if you stay in business or lose to the competition.

Here is a whirlwind tour of observability concepts and commonly used tools to consider while building out your observability solution.

Monitoring Dashboards

Prometheus

There is nothing more ‘observable’ than a shiny dashboard that visually depicts the state of your applications and infrastructure. Prometheus and Grafana are by far the most effective tools in this space. They make a great tag team. Prometheus ingests your metrics, Grafana visualises them.

You expose your application’s metrics using the Prometheus client libraries. You need to use the client libraries in your application and export the metrics to Prometheus. Prometheus is just a single executable that can be deployed on a VM or as a sidecar container, which then scrapes your application’s HTTP endpoints for metrics and general health info.

Prometheus stores the data in its internal time-series data model. You can then use PromQL query language to build complex expressions, in addition to being able to define alerts.

Grafana

Grafana, on the other hand, is a visually engaging dashboard that is also deeply functional. Grafana supports Prometheus as a data source (among several others) and the deep integration between the two facilitates building of sophisticated visualisations using the metrics exported by Prometheus.

Dashboards are made up of individual tiles called Panels, which can be graphs, statistics, tables, lists, free text, heat-maps or alert lists. Grafana also has built-in alerting capability (and I prefer this over Prometheus alerts because you can have a better view of what triggered the error).

There is a large ecosystem of plugins, packaged dashboards and full-fledged applications on the grafana website that you can save time with before you go and build one from scratch.

Golden Signals

Prometheus + Grafana give you access to hundreds of metrics, but how do you decide what kind of primitive/derived metrics to monitor?

In the Site Reliability Engineering book, Google SREs talk about Golden Signals — the four most important types of signals to monitor:

Latency: How long requests and actions take to complete, with respect to a known good baseline so that you can identify when the application performance starts to degrade.

Traffic: Number of connections, requests per seconds, IOPS, network usage, etc that give you an indication of how much volume is being processed by the system.

Errors: Number of HTTP 500s, non zero return codes and any other indicators of a failure.

Saturation: An indicator for how ‘full’ is the service — for e.g., 100% CPU, disk or network utilisation or resource pressure of any kind that indicates an overload.

To get the best out of your dashboard, make sure you set it up with the critical stats that give you an indication of what is out of order so that you can triage issues quickly. The dashboard should tell you what is wrong without you having to search inside logs.

Health checks

One of the best practices while developing a cloud native application is to make sure the app exposes its health through an HTTP end point. Any observer like your cloud Load Balancer or the Kubernetes platform can then query the state of your application to make decisions like stop sending traffic to it or replace it with a new healthy instance.

In the context of setting up your own observability framework, health checks play the important role identifying failures so that you can take remedial action, like initiate a scaling operation (when it is an event that can be handled) or raise an alert notification to a human to address the failure.

Health checks are a proactive means of ensuring that the application continuously exposes its well being to its observers (more reliable than the observers trying to guess the state) so that failures can be automatically acted on.

There is also another type of health check, for e.g. the one offered by OCI that you can use where you continuously check the availability of your application from various vantage points in the world. Failures are not always on the application side, it can also be in the network connectivity and localised to certain geographical regions.

For instance, the reason why your users in Johannesburg are not able to connect to the service running in London could be due to a localised network issue, and this ‘observing from various vantage points’ approach can help identify such localised issues easily. Such observations do not have to be restricted to “is it running” but can also be used to keep an eye on latency and quality of service.

Events and alerts

An alert is more of a feature of a monitoring tool or a cloud platform to inform a human of a situation needing attention, than a standalone product by itself. But something that needs to be said is the importance of having the alerting system to integrate into your incident response system like PagerDuty and the ticketing system behind it, like Jira.

Alerts are not just for infrastructure incidents like resource shortages and failures, they can also be used at the application level to notify application owners of exceptional events that the application cannot handle on its own.

For e.g., consider an e-commerce system that has taken payment for a product, but had an internal failure before the order was fully processed. Setting up an alert that logs an incident with the ticketing system for a human team to address the failure (or initiate a refund) can help proactively address the issue without having the end customer to call up helpdesk.

Distributed Tracing

One of the challenges with moving to microservice based architecture is that the call stack between microservices can grow tall and it can get difficult to know where the performance bottlenecks are, or to get a view of the dependencies between the various services. This is where distributed tracing tools like Jaeger and Zipkin become indispensable.

The visual nature of these tools make it very useful in a crisis to understand where the bottlenecks are.

Above is a screenshot of Jaeger showing the traces for a demo app called HotROD and the benefits are obvious — a visual depiction of errors and long running services that are affecting the health of your application.

Another useful feature is to compare two different traces to see where the performance has degraded, a nifty feature to help detect the bottlenecks.

Log aggregation

Logs are still the most authoritative source of information to establish the root cause of the issue. But what is new, is that unlike the old days, you do not want to start with the logs when there is an issue.

Ideally, all of the other observability features should tell you the issue, and you should scan through the logs only to confirm your theory of why the failure happened based on what you saw in other observability mechanisms.

The amount of information logged has increased manifold. Any system these days generates several thousands of lines of logs per second in files spread across multiple locations, we now need a system to make it easy to browse and search for log entries.

Fluentd

Fluentd provides a “unified logging layer” that captures logs from various apps in JSON format and forward these to other systems.

Fluentd supports memory and file-based buffering to prevent inter-node data loss, robust failover and can be set up for high availability.

At the heart of Fluentd is an event driven system supported by a plugin based architecture with hundreds of plugins.

All events are processed against rules you define and are routed to a destination provided by your rules. To process and forward logs from a cloud native app, fluent bit is a lightweight tool from the fluentd family.

ELK Stack

When the ELK stack (made up of the trio of Elasticsearch, Logstash and Kibana) became popular a few years ago, that was a part of major push in the industry towards enabling observability into what the application and infrastructure are doing at any given point in time. It is fair to say that the ELK stack has played a major role in making observability a thing.

Logstash collects and aggregates logs, Elasticsearch makes the logs searchable and Kibana visualises the result in the form a visually appealing dashboard.

Though Prometheus, Grafana and Fluentd have become favourites in recent times with cloud native audiences, the ELK stack remains popular, especially among enterprises. The sidecar deployment pattern of Prometheus and Fluentbit has also contributed to their popularity.

ELK stack also has an equivalent for Fluentbit called Beats to provide similar lightweight edge based forwarding capability. There is also a dedicated Logs app in Kibana that provides a console view of logs, APM to provide Application Performance Monitoring and Metrics to visualise infrastructure metrics.

It is not all or nothing with the ELK stack, another popular combination (EFK) is to use Fluentd in place of Logstash.

Sentry

Sentry is a more specialised tool with a very specific purpose — to surface those errors and exceptions in your application code. You include the client libraries in your code and it intercepts the errors making it easy to spot where when and why those happened by pulling together the details.

When an error occurs, Sentry can throw a dialogue box to the end user of your application to enter additional details about what they were doing that led to the error, sharing addition context with the developer to help replicate the issue.

Sentry also has integration with Slack, JIRA and other development tools to make it easier to catch bugs and throw them back into the development workflow.

Logs as event streams

Logging is so important that it is also one of the 12 factors that you need to consider while building a cloud native applications. There are two key messages embedded in this principle.

The first one is that a cloud native application should not attempt to create, manage or rotate log files. The log messages should go straight to stdout so that the runtime environment (Kubernetes) can surface it easily to the observability frameworks mentioned here.

The second message is that you should start treating logs as event streams, so that they can be sourced into big data/querying environments for further analysis for trends and co-relating incidents with other environmental factors. That helps answer questions like how your ability to process a certain number of transactions in an hour changed after moving to a different framework or library.

It can also help address business challenges caused by quiet failures. For example, if a b2b partner phones up to ask about a certain transaction they sent to your system two weeks ago and never had a response for. Having a log repository to reconstruct what happened is very helpful.

People and Processes

Just as important as the tools are the people who monitor and the processes they follow to act upon the events from your observability framework.

You built it, you run it

If the application architecture is based on microservices, you should have already organised the team structures around the microservice boundaries so that the people who the service well are the ones. This ensures that the people looking at the observability framework are the ones who know the application well and are best placed to refine the framework over time.

But unlike the developers who have the luxury of working in isolation to build the service any which way they like, the ‘observers’ need to be aware of how the adjacent services work and what information the other service teams need to ensure collaboration when issues arise that require collaboration

Site Reliability Engineering (SRE)

Google came up with the idea of Site Reliability Engineering as a pioneering approach to run planetary scale systems. The SRE teams are made of equal number of developers and operators, who spend half of their time on operations and on-call and the remaining half automating all the activities.

The results of such high levels of automation are systems that are not just automated, but fully automatic that can react and handle situations on their own.

The SRE books are a must read for anyone interested in building systems that use observability frameworks to build resilient and highly scalable applications.

Synopsis

The purpose of making your applications and infrastructure more observable is not merely to ‘know’ what is going on, but to take that to the next level and prevent them from failing in the first place. Failing which, to set up automated recovery mechanism. Or at least be notified of the failure so you can act on that failure, in that order.

The worst failures are the ones which happen quietly without any indication of an issue having happened. The best failures are the ones where the system detects a failure before it happens and takes remedial action to prevent the failure. Most failures fall somewhere along the two ends of this spectrum.

One of the benefits of cloud native systems is that a well engineered one does not ‘fail’ completely. Most issues just result in degraded performance. The tools and practices mentioned here help pre-empt those/manage those scenarios so that your applications and infrastructure can remain healthy.

As we cannot foresee all possible failure scenarios during the development cycle, observability provides an important feedback loop into the development cycle to make the application better. A well engineered observability framework will give you exactly that - a list of problems that you need to address in your application’s code and architecture to grow with your business.

Prometheus - Monitoring system & time series database

prometheus.io

Grafana: The open observability platform

The open observability platform Grafana is the open source analytics & monitoring solution for every database Get…

grafana.com

Jaeger: open source, end-to-end distributed tracing

Monitor and troubleshoot transactions in complex distributed systems

www.jaegertracing.io

Fluentd | Open Source Data Collector

Fluentd is an open source data collector for unified logging layer.

www.fluentd.org

ELK Stack: Elasticsearch, Logstash, Kibana | Elastic

What is the ELK Stack? The ELK Stack is an acronym for a combination of three widely used open source projects…

www.elastic.co

Application Monitoring and Error Tracking Software

Open-source and hosted error monitoring that helps software teams discover, triage, and prioritize errors in real-time.

sentry.io

Google - Site Reliability Engineering

Edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne The Site Reliability…

landing.google.com