When to monitor your product?

Shift-left your software monitoring.

Know about effectively observing your system early on in SDLC.

Karishma
Technogise

--

Photo by Pawel Czerwinski on Unsplash

Whenever I consult on a project, my aim is to have good logging, monitoring and alerting in place for the entire infrastructure ASAP. Not post production; not near release. Just as early as possible. Why?

Because with this mindset, any team would start thinking about building a secure and reliable software from day one. System security, performance and alarm mechanisms would not be “afterthoughts”. Rather, they would be considered as important as functionality.

The team will start analysing performance during development itself. They would proactively approach security experts in the team. Engineers would also have an incidence response plan (IRP) ready. All of this would make sure that the team does not stall monitoring until after an incident.

This is analogous to engineering teams writing scalable and maintainable code, rather than just “working” code which achieves the desired functionality. It’s just good software development practice!

Think about it; would you want to install a fire alarm after a fire, or before? 🙂

Deferring these critical things for later results in a lot of fire fighting at the nth hour. This can potentially lead to “Let’s get it working for now and we’ll revisit it later” syndrome. This causes an “inception of issues” and a domino effect lurking around.

Recently, I worked with a team that was happy with this approach of shifting the software monitoring to the beginning of the product development. So, I took it on myself to setup the needed alarms.
The stack they used was AWS CloudWatch alarms, Terragrunt (Terraform wrapper), PagerDuty (PD).

PD helps put people on rotation of first and second level support. As per the configuration specified, they’d get an alert on email / PD mobile app etc.
In case the first and second level responders do not acknowledge and resolve the issue, the alarm is escalated as per the hierarchy specified.
Alarms may sometimes get auto-resolved too. In such cases, the system needs close monitoring to better understand the issue.

Alarms usually consist of thresholds and metrics. Choosing them would need business & tech stack understanding.

The threshold is a composite of the value that should not be crossed, the period during which the metric is evaluated (say every 5mins) and a statistic (say X percentile).

The commonly observed metrics are latency, errors, throttled requests, iterator age, CPU Utilisation, DB connections, throughputs, read & write capacity etc.
You should monitor all your components like API Gateway, Lambda, RDS, Dynamo DB etc.

To get started, you can have the setup for lower environments. This would help you fail fast. On production, you can observe the system for a brief period before and after an alarm is finalised. This way you can fine tune your alarms. Once the product goes live, you can monitor your system over time to make changes to alarms if needed.

From personal experience I can say:
After all the refinements, you’d have a more robust system and a good work-life balance 🙂. No more deferring fixes of those “trivial, intermittent” issues. No more frequent fire-fighting at the nth hour.

Later, the other products in that organisation took this up on priority too. Hope you all will do too!

--

--

Karishma
Technogise

QA Architect | Ops practitioner | System Design enthusiast