A Practical Guide to Monitoring and Alerting

Pradyumna Challa
Hiver Engineering
Published in
7 min readJun 28, 2024
Photo by Luke Chesser on Unsplash

In this blog we will explore how to implement monitoring and alerting within an organization. It will cover the core concepts, implementation strategies, and best practices to ensure your applications and infrastructure are always healthy.

As we are all moving towards a more microservice-based architecture, the number of moving pieces to manage are growing exponentially. Instead of a service making a direct database call, we have another service behind load balancer which is then linked to a DNS record. This just increased the number of failure points from just one to four. Originating service to DNS, DNS to LB, LB to target service, and finally target service to DB.

Detecting failures has become harder as each request from the end customer will create multiple requests to multiple micro-services within our internal ecosystem. Creating a robust framework for monitoring and alerting will help us to detect and fix failures faster. Better yet detect failures before they occur.

I’ve divided this article into 3 sections where I will walk you through creating and implementing a robust framework for alerting and monitoring in your organization from the lessons learned at Hiver.

  • Process
  • Tooling
  • Maintenance

Process:

The boring and time-consuming part first — the process, both from a tooling and people perspective. This is not just relevant for the DevOps folks, but also for engineering teams.

Photo by QArea Inc. on Unsplash

In a monitoring and alerting framework, the first and most important ingredient is metrics. We need metrics to know what is happening inside the systems we manage. On top of this, we create alerts to know if the metric is at an acceptable level or not.

Metrics can be broadly categorized into operational metrics and business metrics. Operational metrics are pretty straightforward and are readily available, some examples are RAM, CPU, messages in queue, connections on a database, IOPS of disk, etc. Business metrics are unique and different for each company and have to be created from scratch.

Let’s dive into the process,

Metric Sources:

Identify the metric sources. For operational metrics kubernetes clusters, Queueing systems, Databases, and EC2 machines are prime examples. Business metrics are generated within the code.

Metric Lifecycle:

Set up mechanisms for the generation, ingestion, storage & deletion of metrics across the organization.

This would include a combination of systems working in tandem with each other. For operational metrics, this would be setting up a Prometheus stack on kubernetes clusters, using Cloudwatch for managed services, etc. Create standard libraries for all application teams to use for generating and transmitting metrics.

The importance of metrics deteriorates over time at different rates for different metrics, latency of a service would have a longer lifetime than the status of a pod. At some point storing & maintaining unnecessary metrics will become costly both in terms of human and infrastructure resources. Defining the lifecycle for metrics is very important.

There are two ways to implement such a mechanism and both are not exclusive.

  1. Implement a hot and cold storage mechanism where metrics are readily available in hot storage but would have to be restored from cold storage for querying. Timelines for both hot and cold storage have to be defined.
  2. The second way is to down-sample after a certain time.

Centralized visualization:

Even though there are multiple metic sources, for the end users i.e. developers, QA, and leadership team there should be one place where all of these metrics can be combined and visualized to get a 10,000ft view and dive deep if needed.

Impact Levels:

Once we have all of the data and we can visualize the next step is alerting on top of the metrics. Identify the metrics that would impact customers both from an operational and business perspective. Within each metric identify at what value of the particular metric product would be, Unusable( P0 ), Some functionality is not working ( P1 ), Product is working but there are some minor inconveniences ( P2 ).

Alert Routing:

Routing the alerts at different impact levels to the right channels which would get the right attention is paramount. Set up standards on where alerts would be reported for each impact level, An Example of that would be — all P0 would be sent to on-call, P1 would be sent to Slack on a high-priority channel, and P2 would be sent to Slack on a low-priority channel.

Escalation Matrices:

Even though alerts are routed to the right channels and people, sometimes issues have to be escalated to get them resolved. This is where escalation matrices will help. Work with stakeholders to set up the escalation matrices when issues are not getting resolved at expected timelines. Document the expected timelines too while you are working with the stakeholders.

Document all the above steps so that all stakeholders are on the same page. Reaching alignment with all stakeholders will help you in implementing the standards and processes.

Tooling:

Now that we have clarity on what needs to be implemented, let’s take a look at how it needs to be implemented.

Photo by Barn Images on Unsplash

Tooling is very subjective, as it has to fit into the existing systems in place. Instead of suggesting tools, I’ll walk through some of the principles that need to be observed while selecting tools.

Process Fit:

Walk through all the steps in the process defined in the previous section and check if the tool checks all boxes. This will help you avoid costly mistakes of either going back to the drawing board or writing unnecessary automations.

Stability:

Obvious but often overlooked is the stability of the tool. Check for points of failure and does the tool provides ways to deal with them. Does the tool provide clustering? Does it have an automatic backup feature? How much load can the tool take before failing?

Monitoring the monitor:

Failure of monitoring and alerting tools is catastrophic as we would be unaware of the status of the systems it is monitoring. Does the tool provide metrics about its internals? Can we monitor the monitoring tool?

Cost:

Be it cloud-based SaaS tools or self-managed tools each has its own set of costs to bear. Most SaaS tools have usage based pricing. Take a educated guess on how much usage you would have 2–3 years down the road and evaluate if the additional cost of managed SaaS tools is worth the reduced human resources you would need while selecting between SaaS and self-managed tools.

Integration Capabilities:

You would rarely use monitoring tools in a silo. Having a programmatic access to all the functionalities provided by the tool would help to write automations and trigger various capabilities of the tool. Does the tool provide an API? Which integrations are pre-built into the tool? Does it provide webhooks?

Access Controls:

Different roles would require different sets of access to both metrics and visualizations. QA would need only read-only access, where a dev would need write access. Make sure that the tool supports various levels of access. Provisioning and de-provisioning accounts becomes a chore once the organization scales, automated ways such as linking to LDAP should be present.

Maintenance:

Once we have implemented the tools and processes, the long journey of maintaining and updating starts as the organization scales. Clusters scale, thresholds change … keeping up with the ever-changing landscape is key.

Photo by Nina Mercado on Unsplash

Measure the Noise:

Keep track of all the distractions that teams have to deal with on a day-to-day basis. Create a dashboard where you can quickly look at historical trends of the number of Slack messages per each notification channel, and on-call alerts raised per team.

Identify Patterns:

Set up regular check-ins to understand problem areas. Some of them could be an instant fix and some of them need architectural changes. Having a ticketing system to keep track of all the changes required will go a long way

Re-Calibrate:

Review the processes, tools & thresholds every 6 months to a year with all stakeholders to keep up.

I hope this was helpful in giving you a better understanding of what to look forward to as you embark on the journey of monitoring and alerting.

Lets connect to keep the discussion going. LinkedIn

Join us

At Hiver, we’re not just sharing emails; we’re building the future of communication with technology that bridges gaps and brings people closer, no matter where they are.

If you’re excited by the prospect of solving complex problems, diving deep into the world of distributed systems, and making a tangible impact on the efficiency and reliability of email-sharing workflows, we would love to hear from you. We believe in fostering a culture where creativity meets technology, and where individual contributions are valued and celebrated.

Discover the opportunities waiting for you at Hiver by visiting our careers page. Whether you’re a seasoned developer or just starting your journey in tech, we have a place for you. Together, we can shape the future of communication, one email at a time.

--

--

Pradyumna Challa
Hiver Engineering

DevOps EM | Platform Engineering | Leadership | IDP | Cloud Transformation