Why do engineers not care about application monitoring?

6 min readMar 30, 2019

Monitoring is easy. It’s a known fact. Bring up a Nagios box, put NRPE on a remote system and point a Nagios host config at the NRPE TCP 5666 port, you’ve got monitoring.

That’s so easy that it’s almost painful. You’ve got basics like CPU, disk, ram coming in via the defaults in Nagios and NRPE, but that’s really not “monitoring” per se. That’s keeping an eye on the basics.

(It’s typical to put PNP4Nagios and RRDtool and Thruk, setup Slack notifications and bolt straight to nagiosexchange but glossing over that for now.)

Good monitoring is actually pretty hard, where you actually know things about the application being monitored.

Monitoring’s hard?

Any server, be it Linux or Windows, will, by definition, be serving a purpose. Apache, Samba, Tomcat, file store, LDAP, all these services are more or less unique in one or many ways. Each has it’s own function, it’s quirks, it’s different means of revealing it’s metrics, the KPIs (Key Performance Indicators) that you will care about if the server is under load.

(wish my metrics boards could be coloured in neon blue -sighs wistfully-.. ahem..)

Any service-providing software should have a mechanism to reveal these metrics. Apache has it’s mod-status module, revealing it’s server-status page. Nginx has it’s status stub. Tomcat has JMX or specific webapps that reveal key metrics. MySQL has it’s “show global status” command that can be scraped, etc.

So why do some developers not bake in these measures into the applications they create?

Is it just developers doing this?

A certain level of apathy towards metric emission is not limited to developers. I’ve worked in companies where the dev’s create Tomcat apps and don’t emit any kind of custom metrics, no custom logs of activity on the service besides generic Tomcat error logs. Some dev’s in companies emit plenty of logs that don’t mean anything to any sysadmin unlucky enough to read them at 3.15am.

The system engineers that allow these products to go through to production have to take some responsibility for the situation as well. Not many system engineers have the time and care to try to drag meaningful metrics from logs, without a context for the metrics and the ability to interpret them in light of the application’s activity. Some do not consider what possible use can they be apart from indicators that “something is currently, or about to be, going wrong”.

The change in thinking in the need for metrics needs to occur not just in the developer’s realm of concern, but also the systems engineers.

For any systems engineer who needs to, not just to respond to critical events but also to ensure they don’t happen in the first place, a lack of metrics is usually the blocker to that.

Systems engineers typically don’t go poking around in the code making money for their company, however. They need developer champions who understand the benefits of the responsibilities of the systems engineer in flagging problems, raising awareness of performance issues and the like.

This devops thing

The devops mentality describes a merging of the dev and ops mindsets. Any company stating that they “do devops” should:

be told that they probably don’t (cue a meme from the Princess Bride about “I do not think it means what you think it means”)
be encouraging an attitude of continuous improvement in the product.

You can’t improve a product and know it’s been improved unless you know how it’s currently doing. You can’t know how a product is doing unless you understand how it’s component parts are performing, all the services it relies on, it’s key pain points and operational bottlenecks.

Unless you monitor these things, these potential bottlenecks, you won’t be able to perform the “Five Why’s” in a postmortem, you won’t be able to put together a single pane of glass to see how a product is doing or to know what “normal and happy” looks like.

Shift left, LEFT, I SAID, LEEEEEEEEEEEE-

For me, one of the key tenets of Devops is “Shift Left”. Shifting left, in this context, means shifting the ability (not the responsibility, just the ability) to do things systems engineers typically care about like create performance metrics, make better use of logging etc., further to the left of the SDLC (Software Delivery Life Cycle).

Software developers should have the ability and training, in whatever monitoring products the company is using, to implement monitoring in all it’s forms, metrics, logging, monitoring interfaces and, crucially, see how their product is doing in production for themselves. You can’t get dev’s invested in monitoring unless they’re able to see the things and affect how they appear, how their product owner gets to present them to the CTO at the next briefing etc.

Long and short of it

Lead the horse to water. Show developers how much pain they can avoid for themselves by helping them emit proper KPIs and metrics from their applications, so that they can get shouted at less by their product owner who has been shouted at by the CTO. Lead them to the light, gently and calmly. Failing that, bribe, threaten and cajole either them or their Product Owners into implementing these metric emissions from their apps as soon as possible, then graph the things. It’ll be hard as it won’t be seen as a priority and the product roadmap will have plenty of revenue generating projects waiting for implementation so you will need a business case to justify the time and funds spent on monitoring implementation in the product.
Lead the system engineers to a sleep-ridden night. Show them that enforcing a go-live checklist on any product going live is a good thing and that making sure that metrics are being emitted by any application in live will help towards a healthy night’s sleep by allowing devs to see exactly where systems are going wrong. That said, the correct way to annoy and frustrate any developer, product owner and CTO is to stick your feet in and be a blocker, stubbornly. Taking that action will impact on any product release date if left till the last minute so again, shift left, get these concerns in the project roadmap for the product as soon as possible, sneak your way into the product meetings if necessary. Wear a fake moustache and a fedora or something, it never fails. Make your concerns known, make the benefits obvious and evangelise.
Make sure that both dev and ops can understand the ramifications and consequences of any metrics on the product’s monitoring dashboard going into “red zones”. Don’t leave ops as the sole gatekeepers of a product’s health, make sure that dev’s are invested as well (#productsquads).
Logs are great but so are metrics. Combine the two and don’t let your logs go to waste in a huge flaming ball of uselessness. Get the devs to understand why no one else understands their logs with demonstrations and show them how to look at things from the 3.15am view.