How to build world-class observability based on DevOps principles?

monitorjain
Intelligent Observer
11 min readJan 22, 2019
An image prescribing a utopian monitoring center (Pixabay: CC0)

First, welcome to 2019. Hope this is your year of rapid rise and achievements.

In this article, I will be sharing the secret recipe for building a world-class observability unit with core tips based on using the right tools, steps and the best principles available in the industry.

I’ve stayed vendor-neutral in order to really do justice by helping a larger group of teams, individuals and companies who seek the answers. I will let you be the judge of the best tool that works for your unique use-cases, challenges, and team dynamics.

Who is this for?

Personas that can leverage this article are Developers, Executives, Service Delivery folks and SREs (Pixabay: CC0)

If you are building a brand new service, decomposing monoliths to micro-services — lifting and transforming, decomposing traditional app services into functions, facing scalability issues that forced you to stop innovating and switch to stability mode especially due to growth bursts every few months, this is for you. In terms of roles, developers, SREs, service delivery, product owners, testers, DBA and CTO/CIOs can use this template.

Alright, let’s jump straight into.

The 4 stages of setting up a world-class monitoring center (Pixabay: CC0)

IDEA Stage: Understanding your objectives and goals

Mind you, this recipe is battle-tested over and over again, 10s of times probably even 100s I never really kept a count.

The secret recipe for building a successful monitoring strategy is actually very simple. Start with the end-goal in mind aka enable a fast growing business atop a user service that delivers a 5-star experience for those end users. Sounds trivial, think again! The trick is to fully understand what this end-goal means and being able to unpack it in terms of your observability needs in terms of small and achievable milestones. If you’re out of ideas, read on as I have shared them here.

PLANNING Stage: Time for action

“Be a woman/man/team with a plan or a blueprint”.

STEP 1: This is the most critical step — work on a proper FMEA!

  1. List down potential issues and its criticality which is also known as ‘Failure Mode &effects assessment’. To tickle your imagination I will give you a small preview. Failure modes are your degradation scenarios and failures that can happen with your systems. For e.g. Website down, API service down, database out of connection pools, sluggish UI response etc.

The severity levels: This column represents the severity of a certain failure mode occurrence which means the criticality of issues that may result in severe disruption, revenue leakages, customer experience turmoil and angered customers. This values ranges from catastrophic, high-risk, low-risk and medium-risk

The expected or past frequency: This column represents the probability of occurrence meaning the higher the number the more likely that failure mode will present itself in production services. This value ranges from frequent, occasional, uncommon and remote

Detection capabilities: This column represents the probability of detection. The highest probability meaning fault will pass undetected to the end-users and lowest means tests will catch the failure modes.

FMEA Template (MBASkool.com)

Do the above for every failure mode (degradation and error types) that has been experienced in the past and are possible for your DevOps/SRE Teams to experience with mission-critical services. Your alerting & dashboarding strategy aka gameplan is based on this. This is inherently super important stuff!!! Without such a template or blueprint, your dashboards & alert patterns will be hit-or-miss.

In a nutshell, the higher the RPN value, the higher the graphs/data/widgets for these failure modes will float on your dashboards and the higher its criticality for proactive alerting. The single most inhibitor for teams today is the lack of balance in data which is caused because a majority of teams have engaged in data warfare. I wish the answer to this all was that ‘collect as much data as you can and you won’t have downtime’.

The brutal truth is modern and hybrid app services like functions (which are also warm and cold containers, now you know!), microservices, service orchestration layers, and service meshes add more challenges in terms of observability, however, concurrently they enable goodness like lower TCO, better ROI on infra, separation of duties, agility in service start-ups, clean engineering and operation processes etc.

STEP 2: WHAT TO TRACK? List of SLIs/KPIs to track both horizontally and vertically

In simple terms, this 2nd list answers the humungous ‘what to track’ question. This is rather simple if you make an attempt to understand the app stack and the processes (full app cycle) and services involved in realizing an application from the developer’s laptop to production ecosystems.

In order to build and run the first-class user services, you got to measure the entire app lifecycle. In order to measure across the lifecycle and fix any issues early, you need to list down what those issues have been for your company-aka capture the ghosts of the past, flesh out their criticality (depends on the potential revenue leakages caused or could be caused due to their occurrence) and you can’t think of monitoring as an afterthought. But, hey! the FMEA already has this information, which is why it should always be a STEP 1.

Next, make a curated list of all the signals/patterns/KPIs/data/SLIs that can help you proactively alert and display the bad symptoms (latencies and errors). Organize the KPIs horizontally and vertically, the horizontal arrangement will provide you the name list of your dashboards. No more death by dashboards, yes!!! The vertical arrangement will provide you the layers of your dashboards.

Horizontal arrangement meaning assorting them in terms of the stage or DevOps phase during which that KPI is being collected from aka its family or group. For instance, source tools like Github, Bitbucket, Gitlabs; build tools, and deploy tools.

Vertical arrangement meaning assorting them in terms of the stack order, if a typical full stack is made of 3, 4 or n dimensions then organize them by the layers. This is exactly how your alert policies is supposed to be mapped. For instance, these dimensions are UI, microservices, API layer, backend legacy, infrastructure, non-app services etc.

To tickle your imagination, I’ve furnished some examples for the KPIs/signals to include in your list:

HORIZONTAL COVERAGE

Horizontal coverage is inspired by our code pipeline stages (Image sourced from Vmware Inc Blogs)

Pre-production — for which the key consumers are developers, service delivery/DevOps, testers, and CI/CD squads.

Source — Github, Gitlabs, Bitbucket (e.g. KPIs — successful commits, product-wise commits, branch per user, file modified etc.)

Build — Bamboo, Jenkins, CircleCI, TeamCity etc. (successful & failed builds, response time per builds etc.)

Deploy — Codeship, Atlassian entire, Nolio etc. (successful & failed deployments, response time per deployment, GA vs canary release, response time, CPU %, memory %, error rates for a certain deployment etc.)

Test — Loadrunner, Blazemeter, Saucelabs, Appium, Tosca etc (Hotspots and issues — functional i.e. code-level or pool depletion and non-functional i.e. scalability issues)

Prod — K8s, containers, API services, frontend and backend services etc.

P.S. When aggregated, the above comes together to form your code management pipeline.

VERTICAL COVERAGE

Classic Modern App Stack + a monitoring solution

Production/Operations —for which the key consumer is SRE/Operations engineer or support tiers.

User Experience (page load time, business errors — javascript errors or bad AJAX response code, browser code releases, AJAX response time, SPA response time, browser traces etc.)

Microservice, hybrid or Serverless backend (backend response time, database response time, external service time, code deployments, error rates — code-level & http response issues etc.)

Infrastructure (OS level monitoring) (CPU %, config changes, cloud event changes, ephemeral host up or down state, memory, network, and disk etc)

Cloud services + non-app services (K8s error events, pod failures, replication issues, deployment issues, container-level stats, Redis, Memcached, database monitoring, AWS/GCP/PCF/OpenShift/Azure native services metrics.)

P.S. When aggregated, the above organically come together to formulate a modern app stack.

Without a plan, you will end up using a haphazard approach. Follow a plan and we will make things happen (Pixabay: CC0)

See how beautifully, the list of KPIs go wide and deep in terms of covering both pre-production and production services and helps you envision digital intelligence across your entire enterprise that serves a multitude of mission critical stakeholders — Developers, Service Delivery, Engineering Management, CIO/CTO, Executives, SRE Leadership and its team members etc. You’d be surprised how similar the output will be in comparison with the ones you recorded under FMEA.

TIP 1: The above is especially imminent for a microservices or serverless app ecosystem because the performance monitoring and the total number of connections on the critical path grow exponentially. Tools that provide module-level analysis wouldn’t work and will be prohibitive to success and achievement of a truly best-in-class center. Chose a tool that includes distributed tracing in its list of monitoring outputs, focuses on P99s and outlier detection. Sampling is fine, if the issues are either scalability or code related, most common sampling types – tail-based, head-based, adaptive, rate-limiter based should catch the problems. Don’t buy fud – full trace and transaction collection is expensive and causes overheads.

TIP 2: Risks and criticality analysis is really powerful and come in handy during the setup stage. Don’t forget, just like you can’t build muscles at the gym by workout 1 day in a year for 10 hours straight, setting up this center will require incremental improvements as soon as you learn something new. Hence, setup a bi-weekly workshop (30 mins) to update alerts & dashboards based on most recent incidents and port-mortem learnings (If during post-mortems you experienced bad detection with a certain anomaly, jot it down and update your dashboarding and alert conditions accordingly).

This is the answer that nobody wants to hear but is almost always applicable.

While building your monitoring center, have a strategy in mind for various roles and use-cases (Pixabay: CC0)

Once you’re at this stage, you should have risks and criticality fleshed out. Plus, you should have the list of the alerting/dashboard signals out of your way. Lastly, you can develop some quick run books and recipes that can be home-tested and will give you a set list of steps to follow when disaster strikes in the moment of truth (your peaks, your promotion periods, festive season etc.). Build these run books surrounding your identified failure modes.

Tip 3: Do not end-up under-alerting and over-dashboarding. This is the most common mistake, we are not built to stare down at dashboards. The beauty about alerts is that they make us accountable. Try to have a like-for-like environment for both of these monitoring outputs. Which means if a dashboard displays a certain KPI/widget, there is an alert for it and viceversa.

STRATEGY Stage: The win on the culture front

Get your heads together (Pixabay: CC0)

If you’re with me so far, embrace the DevOps practices while enabling monitoring. This is where Engineering and SRE leaders need to join forces by building a simple bridge of unity and oneness and not just one defined by a chain of tools. The truth is we can’t really copy cultures from a successful organization but we can certainly copy their structures. This will give you a true advantage in enabling a blameless culture where you always lead with a ‘solution first’ mindset. Once you get to this stage, it’s an easy road from here just like those fast lanes in German Autobahns. Automate to eliminate toil and don’t blindly follow the DevOps loop that has monitoring as a core piece but towards the tail-end.

Tip 4: True end-to-end is not a single pane of glass, I know so because I’ve worked with 100s of companies in the last 7 years on this exact use-case and the reality of single pane was always protracted, it’s music to everyone’s ears but rarely realized (this is due to several dimensions, a topic in its own right). When companies request end-to-end explicitly, they are implicitly asking you if you have them covered.

SUCCESS Stage: Time for action

Celebrate your success (Pixabay: CC0)

Onboarding: Now it’s game day, the fun part begins. Bake cross-stack agents (whether open-source or proprietary in flavor) right in the container, as a side-car, using external mount-points or VM images. For functions as a service, enable tracing API calls within the function code so that distributed tracing can be leveraged. Next, enable alerting and dashboarding via automation so you don’t have to develop them after the agent installation happens in production, you could leverage Terraform, CF or Ansible scripts and have them ready.

Dashboards & Alerting patterns: Remember the criticality level discovered in FMEA, plug them right in the alert conditions name so that when an alert triggers, the name prescribes the criticality. Also, list of SLIs that were deconstructed from FMEA or via horizontal-vertical coverage will result into end-to-end dashboard apps where each tab represents a certain stage of the app (horizontal stages) and every dashboard will have the vertical layers mimicking the app stack with highest RNP (risk preference value) value at the top. Enable dashboard widgets that allow filtering by certain metadata for powerful insights). Build dashboards that tell a story uniquely to a developer, SRE, product manager, executive, and other important personas.

Machine intelligence for prescription and auto-RCA: Ideally, pick a monitoring tool that has applied intelligence (machine learning) based alerting so that you are able to receive incident contexts and outlier detection insights. This is one of the few ways to get close to your prescription rather than discovering symptomatic anomalies. I call this SRE-in-a-button capability. There are several other correlation assets like request flow maps and hybrid maps that display both infra and app performance in a single view. These are all super useful.

The utopia I aim for your team is your ability to refocus your teams based on these insights, lesser issues (mostly scalability-led during your moment of truth) making it to your prod services and whenever hell breaks loose, you got all your ducks in a row.

Hope you learned some contemporary tips and tricks through this, I have helped several modern and well-established companies adopt these same principles if you wish to learn more feel free to send me an email to monitorjain@gmail.com. I am excited to answer your queries. Watch out for my next post on implementation best practices.

If you liked this article, please show your appreciation by “clapping” — click rapidly on the green hands’ icon below — so that other people can find it. Thank you.

--

--

monitorjain
Intelligent Observer

Value Engineering | SRE, Cloud, and Dev advocate | Tech enthusiast | Kaizen practitioner | Presales coach | Dad