Show me the money! — Monitoring Production the “Jerry Maguire” way

Lior Avni
Lior Avni
Jan 23 · 5 min read

Every Escalation engineer knows this simple truth — “if you find it faster, you will solve it faster” (“it” being the incident you want to avoid). To detect those pesky incidents, we invest a lot of money in various application performance monitoring (APM) tools, and most of us use the out-of-the-box functionality these tools offer.

But, what if I told you that you can save hundreds of thousands of yearly dollars if you take a step further and create transaction-specific alerts and be aware about third-party elements affecting your system. Would you like to know more? Than take the red pill and I’ll show you just how deep the rabbit hole goes… (Sorry, couldn’t help myself 😎)

“In the beginning, it is always dark…”

So you’re starting your journey in managing a production environment. You implemented an APM solution from one of the leading vendors (full disclosure — I use NewRelic in my production environment, but there are other vendors to choose from and you should always do your market research before choosing a tool), and you start defining the basic alerts: Apdex and Error rate.

Error rate is the percentage of errors generated by the monitored service over a set period of time.

Apdex (Application Performance Index) is slightly more complex: it is an open standard for measuring the performance of software applications in the IT world (among other applications). Its purpose is to convert different measurements into insights about user satisfaction, by specifying an agreed upon approach to analyze and report the level to which measured performance meets user expectations

In simple terms — it tells you if your users are happy or dissatisfied with the service you provide them. Each of these basic alerts have a threshold that should be set when you configure these basic alerts and, as we say back home, “Andiamo” — you’re good to go. This is all well and good when you deal with a system that has 5,10 or even 15 applications to monitor. But what happens when your “house of cards” grows even bigger? Or to be more clear, when you start to break your monolith into dozens of Micro services which intertwine in a delicate fabric of dependencies with each other. The default monitors just aren’t enough. You need to take this a step further.

Crossing the threshold…

Why do we need the additional monitors? The simple answer is that when you’re alerted to an Apdex disturbance of a service, you’re literally blind to which service / element is causing that disruption. You’ll need to dig deep into the service monitors to find out what caused it, and when dealing in real time incidents, every second counts when you’re racing to understand the cause.

You want to receive early notifications so you can

  • target the correct engineer, who will eventually solve the issue, and
  • shorten your incidents MTTX (Mean time to Detect, Understand & Resolve).

There are 4 additional monitors you can apply out-of-the-box:

  • Response time for Web transactions
  • Response time for non-web transactions
  • Throughput of Web transactions
  • Throughput of non-web transactions

These act as an additional drill down to your respective service’s “vital signs” monitor as they present not the general behaviour of the service, but the various elements within the service which may trigger before the “general storm”. Web transactions are the application server’s transactions and the non-web are, well, everything else. 😇

But we can take this even further… come, take my hand, I don’t bite 🤡

Taking the road less traveled…

This is not an easy task. It involves a thorough investigation into your system, knowing its internals and knowing which team is in charge of which service. It also involves a deeper understanding of your APM solution and knowing how to query the database it uses to maintain the data (NewRelic, for example, has its NRQL for this purpose), but again — knowledge is power, and in this case, it can also save you a lot of money:

  • External services alerts: you can set a monitor on every external service that affects your mission critical services. By doing so, you can foresee an incident almost a full minute before it starts. Most built-in monitors take a minimum of 5 minutes before they trigger. If you set this additional safeguard, you receive a pre-alert and can preemptively track and fix the affected area, before the system is fully affected.
  • Specific transaction Apdex: Each service has its critical transactions and the less critical transactions. Each of these transactions have an Apdex that can be monitored and an alert that can be triggered if it “fires up”. Once you query the database that holds the transaction data, you can save 2–3 minutes of precious alerting time and be aware in advance that “a storm is coming”.

Where is the money you talked about???

I did talk about saving money, right? So let’s get down to business — how is this going to save you money? Simple.

Downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime (according to a 2014 Gartner’s analysis).

A typical company can suffer 1–3 incidents per week, ranging from partial downtime to full service disruption. By applying these additional monitors and alerts, you can potentially save 1–3 minutes of investigation per incident! That’s 3–10 minutes per week, 12–40 minutes per month, 36–120 minutes per Q, and 144–500 minutes per year! Or to put it in financial terms:

± 200k$ savings per Q

which translates into

± 800K$ savings per Y!!!

Imagine what you can do with all that money… 🤔

This is only the beginning…

There’s a whole world out there for you to monitor. The more you utilise the tools at your hand, the better understanding you’ll have of (and about) your system and the better your control will be when handling Incidents in real time.

It’s a never ending story my friends; So go, cross the rainbow and find the answer… (and if you understood the reference, than I’m not that old… 👨‍🦳)

Until we meet again, this is your friendly Gett Incident Management team signing off and reminding you, as always — Don't panic…🦉

Lior Avni

Written by

Lior Avni

Global technical support & Incident manager at Gett. Working with customers for the better part of 20 years and enjoying every minute of it :-)

Gett Engineering

Code, stories, tips, thoughts, experimentations from the day-to-day work of our R&D team.

More From Medium

More from Gett Engineering

More from Gett Engineering

Disposing RxSwift’s Memory Leaks

More on Engineering from Gett Engineering

More on Engineering from Gett Engineering

Error Handling in Go 1.13

More on Incident Management from Gett Engineering

More on Incident Management from Gett Engineering

Scientia potentia est — Knowledge is power

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade