Requirements for monitoring a modern app stack — part 1

monitorjain
Intelligent Observer
6 min readDec 15, 2018

An illustrated guide to becoming a better SRE and developer whilst embracing some contemporary monitoring tips

We’re almost nearing the end of the year 2018, and about to enter 2019. The newest electric cars charge in under 3 mins, Virgin Galactic supersonic plane has reached space, however, in the monitoring & DevOps world teams are still struggling to launch full-scale monitoring solution to manage a modern app stack. Thanks but no thanks to the assortment of CI/CD tools thrown our way and the forced advent of microservices, cloud & DevOps practices. Whilst, they provide countless optimizations and benefits, the biggest pitfall of these trends is performance aka digital user experience and convolution. To a lot of DevOps and SRE Leaders, commisioning a true state-of-the-art, modern monitoring center feels like the cryptic chemistry & geometry math formula back at school shown in the photo here.

Photo: Geometry math formula back at school (Pixabay, CC0)

Fret not, not for long!

On a positive note, enabling full-stack observability is not as complicated as it used to be, but the secret sauce is still well guarded like the coke formula. As I promised not for long, I shall de-mystify for you in my series of posts.

Before getting to the formula, let’s take a step back, and define and visualize a modern app stack. I call this stage as ‘understanding the beast’. To provide you with a reference, I built this simple wedding cake-like diagram that represents a modern app stack.

Let’s describe the modern app stack in the diagram (my own creation) layer by layer.

Photo: Modern application stack (Infra-app-UI)

Bottom-most layer: You’ll typically have your hybrid, private or public cloud infrastructure where you would most likely be using serverless services, microservices hosts like ECS, Fargate, AKS, EKS, GKE, GCE, GAE etc. Plus, you will have a lot of non-app managed or self-managed services like MongoDB, Consul, Varnish, Nginx, Redis etc.

The middle layer: It will typically be the decomposed or refactored app services — cloud functions, containers, and Java|Ruby|Python|Go|Node|PHP|.NET core app processes together with some legacy service bus, 3rd-party integration services, messaging systems, API-gateways etc.

The top layer: It will be your user services which support multi-channel user experience in form of both native and hybrid mobile applications, mobile web interface or web services, and/or desktop web single-page or hybrid HTML services.

The invisible layer: Let’s not forget an unaccounted layer, the code deployment, configuration, and infrastructure changes that keep propagating in our app ecosystem. Hence, I call this the invisible layer.

Rather intricate huh! “One way to eat an elepahant is one bite at a time”.

A majority of the customers I work with, regardless of the vertical they’re in, have the aforementioned application ecosystem which poses complications when deploying a monitoring strategy.

If you have a similar app ecosystem, next we will cover the best way to monitor such environments. Any monitoring platform requires a solid collection strategy 1) out-of-band 2) in-band collection mode that is based on logs, API or agent monitoring.

In simple terms, in-band monitoring of apps will use the app request streams to collect and report performance telemetry, whereas out of band monitoring mode will create a separate, dedicated channel to report performance telemetry.

My inclination is towards the out-of-band agent-based collection, although out-of-band logs collection works wonder for serverless cloud functions.

Then simply break down the monitoring platform commisioning task: the first stage is the full-stack collection (do this as quickly as possible and worry about visualization later), the second stage is the propagation of telemetry via the collection backbone (leave it to your monitoring vendor) and last is the responsive web (both mobile + desktop) and mobile native visualization (you don’t want to wake up from sleep and launch the UI on your laptop, restore some zzzzzs). Clearly, I’ve black-boxed the intimate details which relate with each stage, the reason being I believe in using ready-to-use platforms rather than building-my-own. Building a platform (using OSS) takes your focus away from the outcomes — root cause analysis aka SRE use-cases, cloud management, ‘DevOps culture’ based monitoring, and digital user experience management. You don’t want to be in the infrastructure management, upgradation, and capacity management business.

BECAUSE ITS TIME-CONSUMING AND CUMBERSOME.

For any intelligence or analytics platform that can gracefully collect, process and visualize the stated modern app stack, it will need to possess four must-have attributes (essentially analogically speaking, you will just require 4 storm troopers to win the performance and service reliability warfare):

Photo: Stormtroopers representing the 4 key attributes of a monitoring platform (Pixabay, CC0)

1) Dynamic scaling aka SaaS collector backbone so that you are able to maintain a laser focus on monitoring outcomes to optimize mean-time-to-resolve/mean-time-to-identify/mean-time-to-failure.

2) Auto instrumentation (zero code-level instrumentation) with SDKs/APIs for extensibility. Code-level instrumentation slows you down and skips the rare transaction flows which can be dangerous in terms of operations management or SRE Teams.

3) State-of-the-art monitoring outputs or patterns like full stack correlation map, out-of-the-box reports, request flow maps, unified alerting engine and role-based dashboarding, distributed traces (CCTV footage of the monitoring space) and machine intelligence for outliers and anomaly detection.

4) Change detection because according to Gartner 60–70% of all outages and service reliability problems happen due to deployments, config modifications and other types of changes. Netflix did a simple survey whereby they learned that a majority of the outages with their service happened during weekdays with the highest concentration during morning 9 am –12 pm. In a youtube video, The director of operations jokingly referred to this as ‘Netflix for literally paying employees to come to the offices and break their services’.

For the fact that a picture speaks a thousand words, this is what a modern full-stack monitoring solution should like:

Photo: A pictorial representation of a utopian monitoring solution/platform

IMHO, DevOps is nothing but a ‘blameless culture of shared pain and responsibility’ and not a bunch of tools that will solve your problems. Trust me I know, I’ve been on 100s of on-calls and war rooms. Hence, if you want to harness the power of DevOps culture, simply harness a blameless culture and say no to retaliation for speaking up. No better way to shift left quickly. Solutions first and no room for finger-pointing.

KEY TAKEAWAYS

  1. Definition and building blocks of a true modern app stack
  2. A simplified definition of DevOps
  3. The 4 stormtroopers of a full-scale, elastic and self-served monitoring platform
  4. The methodology of managing & monitoring a modern app stack

Our monitoring and analytics utopia should not look like the SpaceX monitoring center shown here, simply staring down at screens won’t make the problem go away or make optimizing the monitoring M’s easy, there is a set recipe that we will unravel together in my next post.

Photo: SpaceX monitoring center (Pixabay, CC0)

STAY TUNED amigos.

P.S. As a digital intelligence evangelist, these are my own opinions and not certified or recommendations from a specific monitoring vendor.

Follow me on Github: monitorjain | Twitter: monitorjain | LinkedIn: nikmjain

--

--

monitorjain
Intelligent Observer

Value Engineering | SRE, Cloud, and Dev advocate | Tech enthusiast | Kaizen practitioner | Presales coach | Dad