Proactive monitoring: The why, what and how

Global Technology
McDonald’s Technical Blog
3 min readSep 28, 2022

Imagine the ability to solve an issue before the restaurant even realizes it has one! That is proactive monitoring – see first, solve fast.

by Zainab Tajmahal, Senior Manager, Product Engineering

I sit on a team that is responsible for foundational Next Generation Restaurant Technology Platform (NGP) that enables Artificial Intelligence (AI) solutions. NGP provides a reliable, secured, and consistent CI/CD pipeline, granular visibility into daily operations while enabling standardization at scale. These principles allow faster time to market for deployments, ensuring a better customer experience.

Why
Traditional application monitoring has been focused on resolving problems after either an outage has been detected or a performance threshold has been breached. This is known as reactive monitoring.

In this model, by the time teams are made aware of the problem, critical minutes have passed, and customer-experience levels are negatively impacted.

How do application teams prevent this from happening? How do engineering teams maintain the highest levels of system availability? How can the business ensure customer experience levels remain as high as possible? Raising performance thresholds to artificially elevated levels will only generate alert noise, resulting in unactionable alerts and support fatigue.

For McDonald’s, the answer is: proactive monitoring.

What is proactive monitoring?
Proactive monitoring is the evolutionary next step to reactive monitoring. It focuses on alerting teams to potential problem behavior before these behaviors escalate into failure conditions. Once alerted, the next question to answer is, “What’s broken and why?”

Early identification of problem behavior is critical to heading off incidents that cause downtime and create negative perceptions of application performance impacting business operations.

Proactive monitoring alerts can be broadly categorized into two types:

  1. Application Alerts: Configuring the right thresholds is key for application alerts. Product application teams should align with operations to baseline performance and business thresholds for warning indicators without overburdening support teams.
  2. Infrastructure Alerts: Setting infrastructure thresholds helps with capacity planning, resource allocation, and server failures.

How do we do it?
Monitoring and observability are terms often used interchangeably since it is difficult to monitor a system without the tools and systems that allow for precise observations. Monitoring allows us to watch and detect known failures or incidents based on a predefined set of metrics and logs.
Observability is one of the core pillars of the NGP that enables complex modern microservices-based solutions to ensure high availability while constantly collecting logs and traces to enhance performance and maintain uptime to enable operational efficiencies.

Core components of the NGP proactive monitoring stack collect metrics, errors, logs, trace (MELT) to observe, monitor, and eventually predict events ahead of the actual occurrence. Based on data methods, we collect MELT by:

  • Monitoring — Leveraging an infrastructure and application monitoring tool to view infrastructure and application data with built-in dashboards and queries.
  • Event Logging — Ability to log events as they occur alongside monitoring edge components/configuration of clusters, and the ability to audit data seamlessly.
  • Cloud Tracing — Ingest and analyze logs to help analyze and trace anomalies to understand restaurant systems’ patterns and performance.
  • Near Real-Rime Alerting — Act based on detected anomalies or events to notify persona-based users to react in time.

With automation and event correlation, our strategy is to progress from reactive to proactive to predictive monitoring. With maturity, applications may be able to finetune the dashboards, experiment with thresholds, and group alerts by looking for common issue patterns to reduce noise.

Achieving a good signal-to-noise ratio helps translate to higher-quality alerts, further helping to reduce incident resolution times.

Summary
By proactively monitoring, we:

  • Shift from being reactive to proactive and increase restaurants’ capabilities to ‘self-heal’ and proactively resolve issues.
  • Monitor products and platform at the restaurant/location and cloud in real-time.
  • Help product owners and support teams exceed performance standards by providing decision-enabling key performance indicators.

So, what’s next?
We believe this technology can help restaurants to focus more on customer satisfaction and less on issue resolutions.

Adopting new enterprise tools — like an incident-response platform to generate incident tickets for proactive alerts with runbook automation — moves us closer toward our goal of simpler, more automated processes that enable us to stay one step ahead!

--

--