Resiliency and Chaos Engineering — Part 5

Pradip VS
5 min readMar 29, 2022

--

In this part, I will cover one of the key aspects of Resiliency viz. Observability, monitoring, health checks and Azure’s Observability offering — Azure Monitor.

Observability at the heart of critical systems. Source

This part is in continuation to part 1, 2, 3, and 4. Kindly go through them to get a broader context.

An insightful blog by Cindy Sridharan on Observability & Monitoring, and I recommend going through this.

One of the key aspects of preventing failures, monitor the health, performance, and security of mission-critical systems is to have an observability solution (though we cannot avoid all and or observability does not guarantee that all failures can be avoided).

In Azure, we advocate the customers to log the metrics wherever possible (enabling diagnostic settings in all the azure services, using Azure App Insights for application performance management and Azure Monitor, which gives full observability of app, infra and network)

Monitoring features as seen in an Azure Cosmos DB
Diagnostic settings available for Cosmos DB where one can track data, control plane events and other metrics and push it to their preferred destination or integrate with other partner solutions.

We recommend our clients to push the logs to Log Analytics workspace as it is easily query able and dig deep while troubleshooting.

Pushing the logs to a Log Analytics Workspace to be queried easily

In Azure Cosmos DB, we advocate the customers to capture as many diagnostic logs as possible because it is easy to troubleshoot if there are hot partitions due to partition keys not optimized / not thought through well during design phase, queries running without a partition key filter, or the Request Units (RU) not allocated properly. We see these problems recurring and hence it is mandatory that they enabled diagnostic logs. Since ecommerce applications might encounter a sudden spike in requests, if the above problems are not captured in advance it might result in 429s (throttling) resulting in requests getting dropped which can go up to services getting disrupted. So, to catch this early than to let your application fail on a Black Friday, it is imperative to visit the design and perform stress/resiliency tests to see how it performs and justify it by capturing data into logs to optimize the performance.

Azure App Insights to constantly monitor apps, detect performance anomalies or to diagnose issues and improve performance and usability. Refer — Application Insights overview — Azure Monitor | Microsoft Docs
Azure Monitor provides full observability into your applications, infrastructure, and network — Azure Monitor | Microsoft Azure

Azure Monitor provides ability like observing at any level across stack (be it app performance management (Application Insights), VM (VM insights), container (Container insights), or network (Network Insights)), provides monitoring capabilities for deep insights, enables traceability and log analytics features. Azure Monitor can be integrated with many platforms and suitable for enterprise grade mission critical workloads.

Azure Monitor is the key tool for observability, and we highly recommend our customers to use it for collecting, analyzing, and acting on telemetry data that is obtained from cloud and on-premises.

Azure monitor collects data that fits into one of the two fundamental types viz. metrics and logs

Metrics are numerical values that describe some aspect of a system at a particular point in time.

Metrics are lightweight and capable of supporting near real-time scenarios.

Logs contain different kinds of data organized into records with different sets of properties for each type.

Events, traces, performance data are stored as logs so that it can all be combined for analysis.

This plays a vital role in observability of system so that one can detect failures proactively.

Health Checks:

This is an important aspect and one of the key motivations to perform health check is that no hardware (CPU, memory, hard drive or network components) gives a 100% guarantee. The same applies to software.

Bugs, memory & thread leaks, deadlocks, misconfigurations, redundancy without a coordination are some of the common causes of failures.

These failures are broadly classified into two,

Local or host failures — hardware failures like disk crash, OS crash etc.

Dependency failures — that are caused by the external factors like connectivity issues.

There are a few health check patterns but I will mention the two common one that are widely used,

  • Shallow health check gives superficial information about the capability for a server instance to be reachable (e.g., Ping).
Shallow Health Check. Source
  • Deep health check gives a better understanding of the health of your application since they can catch issues with dependencies.
Deep Health Check. Source
Deep health checks, checks the dependency systems

Deep health checks are both expensive and time consuming while shallow health checks does not give a good insight into health check problems. So, a better tradeoff is to combine them and use it (Use shallow health check during startup, and once all the dependencies are green, turn that shallow health check into a deep health check.)

In hard times, prioritizing shallow health check is preferred over deep health checks.

One of the key aspects is to perform Chaos testing on health systems. While continuous monitoring gives system’s health in real time, monitoring and alerting systems are tested (using Chaos testing) to ensure that they are working properly and showing the right metrics all the time.

To conclude, we keep observability at the heart of whatever we do (even though the design decisions / engineers intuition plays a vital role while designing the systems). We recommend and strive to ensure observability is simplified, only actionable metrics are captured and acted upon with proper data backing. Azure Monitor is the tool we recommend to our customers, and it is important the health checks are constantly done to make sure we are seeing the right health metrics all the time.

Please go through these two blogs which gives a very good insight on health checks and observability.

Lessons from Building Observability Tools at Netflix | by Netflix Technology Blog | Netflix TechBlog

Patterns for Resilient Architecture — Part 3 | by Adrian Hornsby | The Cloud Architect | Medium

Thank you and in the next part, we will focus on Chaos Engineering, its phases, and how it strengthens resiliency engineering and exposes flaws in the system / architectural designs. We will also cover about the Chaos Engineers role, who & how can an organization plan these efforts with the right set of team/individuals in achieving these initiatives. till then stay tuned…

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

--

--

Pradip VS

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.