In the Wild: Monitoring and Alerting on Applications in Production

Kelly Froelich
Ro Engineering Blog
4 min readJan 17, 2020

You release your application to production. Now what? Here at Ro, we test and monitor at all stages of production from the first stages of building a feature to when the feature is live in production.

Monitoring and alerting on applications in production is the backstop to ensuring a high-quality product. The goal of monitoring and alerting, in a metaphor to baseball, is not to prevent wild pitches from getting past the catcher (or bugs from getting into production), but rather to catch them early, resulting in a single stolen base instead of a steal at home plate.

Alerting

Our philosophy on alerting is that if everything is alerted on, then nothing is alerted on. Instead, we have 3–5 key business metrics for each application that indicate its health. These metrics have set thresholds that fire an email if breached.

The first firing is a “warning shot”, indicating that a threshold was softly breached and there may be a business-affecting bug. The second firing is the alert, indicating that a hard threshold was breached and that investigation is necessary to confirm a bug and communicate with the affected stakeholders if a true positive. More on communication below.

No alerts does not guarantee no bugs. Rather, no alerts mean that any bugs in production are not materially affecting the business in the metrics that we care about most. In addition, as these alerts fire proactively, it allows the team to continue with the day-to-day without constantly remembering to refresh a dashboard on health metrics to see if anything is breached.

Monitoring

While alerting is a science (if metrics falls below X, then alert), monitoring is an art. For each application, we have dashboards built in Looker that show visualizations of key metrics, such as hourly conversion rate compared to the average conversion rates of the previous three weeks and the percent delta between those numbers. Here, our team is able to identify patterns in the data and discern anomalous behavior from these patterns.

For example, a few weeks ago, a teammate of mine noticed a dip in the number of renewals we had on a given day. The dip was not large — nothing to sound an alarm — but after investigation found that we were sending out a broken renewal link. Fix implemented, bug fixed. We had a business-affecting bug in the system for only 24 hours instead of, perhaps, 24 days.

Incident Response

Alerting and monitoring become null and void if nothing is done about it. Therefore, when we launched application health monitoring, we also launched it with a protocol for investigation. Our QA team leads the investigative process on a rotating on-call schedule. If an alert fires or if the monitoring shows an anomaly, the point person conducts an investigation to determine if there truly is a problem, utilizing the application experts for where the anomaly is occurring.

For instance, a low top of funnel may indicate a problem on our webpage or an expected dip in traffic. To determine if it is the former or the latter, we rely on our domain experts and reach out directly to work together if we receive an alert or see a pattern in the data. However, if and when the problem is confirmed, it is time to implement the process. For severe business-affecting bugs, we trigger our incident response protocol. For bugs that do not require incident response, we have a dedicated Slack channel where representatives from across the company, in addition to engineering, are present and are tagged specifically so that they are aware.

Our goal is to be the first in the company to catch a bug and the first in the company to communicate the bug without being known for crying wolf (but to err on the side of crying wolf instead of silence).

Communication

Communication is key. Just as alerting and monitoring are null and void if nothing is done, incident response is null and void if nothing is communicated. The first step in communication is ensuring the response team is equipped with all needed information to tackle the issue, so we like to provide both high-level numbers on the scope of the problem (e.g. number of people affected) but also specific examples to investigate.

The second, and equally important, step in communication is with key stakeholders. Communication with stakeholders is not one-way nor is it purely downstream. Rather, we utilized the expertise of these stakeholders to determine what metrics were most important to monitor when we began building the application monitoring and alerting process. We reviewed the monitoring and alerting metrics for feedback with these stakeholders and gave them full access and training to the dashboards so we can align on the data we are seeing. We created a two-way, instant communication channel in Slack for if these metrics were breached for discussion on the root cause. We conduct “exterminator reports”, which are brief Slack messages to recap any bugs caught in the previous week and, more importantly, their business impact. Ultimately, we make sure we are all on the same team in ensuring the health of our applications.

System <> Business

Our next frontier is combining our system metrics monitoring (e.g. what is our latency) with our business metrics monitoring (e.g. what is our conversion rate). The importance of this combination is that we will be able to understand how our system affects our business. “The site is down” evolves into “the site is down so we lost Y potential customers”. As engineers, it is easier to stay in the frame of mind of the former, but it is the latter where we use our understanding of the business to create a better system.

--

--