There’s a CloudWatch on the Horizon

Clouds over downtown Austin (photo credit: Fredlyfish4, Wikimedia Commons)

Here at RetailMeNot, we’ve found that Amazon Web Services allows us to prototype and iterate on software at a pace that would otherwise not be achievable.

We regularly work with our technical account manager (TAM) to make sure we’re taking full advantage of AWS. This includes optimizing our usage of existing AWS features, of course — but our TAM also opens the door to a few forms of special treatment in the AWS ecosystem 😁.

Often our TAM offers beta access to a new feature that addresses an issue we’re faced with. Sometimes we get sneak peeks at the AWS roadmap to help us decide what path to take for something we’re building. We can even provide feedback on a product, directly to their product devs. Maybe this feedback even influences the roadmap 😉 .

I experienced this firsthand earlier this year.

The Inquiry

My team transitioned much of our alerting from Sensu to CloudWatch this year. For the most part, it’s been a smooth move, but we encountered a couple behaviors that surprised us. I emailed our TAM about them (excerpt below).


Subject: Two CloudWatch Nice-to-haves

1. Configure specific state-to-state transitions

We use CloudWatch to alert us of issues in production. For each CloudWatch alarm, we get ALARM emails when action is required. This works great, and I’m glad to be using CloudWatch.

It would also be nice to receive OK emails when the system recovers on its own (e.g. from a load or latency spike). But when we configure our alarms to email us on OK transitions, we get flooded with OK messages. This isn’t a bug — it’s due to an expected CloudWatch behavior. When a CloudWatch alarm is created, it transitions from INSUFFICIENT_DATA to OK. So every time we create a CloudFormation stack, we get as many OK emails as we have alarms with “OkActions”.

Here are some of the emails generated from a multi-region deployment in our Production account:

We create and teardown CloudFormation stacks often in blue-green deployments, so we’ve worked around this by turning off OkActions. But it would be convenient, and useful long-term, to configure this in the CloudWatch API. “Only send an OK email when the previous state was ALARM.”

The following line, from the alarm emails, gives me hope that this is possible:

State Change: INSUFFICIENT_DATA -> OK

If I’m reading this correctly, some part of the CloudWatch infrastructure is keeping track of the previous state, thus could be exposed in the CloudWatch API.

I guess this would be an additional parameter to PutMetricAlarm — something like

"StateChangeActions": [
{
"OldState": "ALARM",
"NewState": "OK",
"Actions": ["arn1", "arn2"]
}
]

To get the former CloudWatch behavior with this new parameter, you could leave “OldState” unspecified. Thus

"AlarmActions": ["arn1", "arn2"]

would be equivalent to

"StateChangeActions": [
{
"NewState": "ALARM",
"Actions": ["arn1", "arn2"]
}
]

2. Ability to “treat missing data points as 0”

Suppose that, every time an event happens, I publish metric data like

{
"Value": 3,
"Unit": "Count"
}

Now suppose that there are occasional intervals where no events happen. Then the event graph is sparse:

For some applications this makes total sense. For other analysis/reporting needs, it’d be nice if we could choose to interpret the lack of data as “0”.

This would be especially useful for event-based systems like AWS Lambda, where there’s no persistent system that can publish “0” data points every minute.


Three days later, we were on a call with the CloudWatch team.

The Discussion

“Thank you for your email. We really enjoyed reading it.”

Given I was criticizing (constructively) what they built, I was relieved to not detect even a hint of irritation in his tone. He followed with more good news.

The team was already considering the ability to configure state-to-state alarms for the CloudWatch roadmap, and he suggested a workaround we could use right now:

  • Instead of the alarm emailing the team directly, configure the alarm to notify an intermediate SNS topic.
  • Write an application that implements a simple filter based on the alarm’s current and previous states, and subscribe it to this intermediate topic.
  • If the alarm passes the filter, have the application forward it to a second SNS topic that emails the team.

We also discussed my request to “have missing data points treated as 0.” The CloudWatch developers identified this as a special case of another feature they were considering: the ability to treat missing data points as non-breaching from the point of view of alarms. If an alarm were configured this way, the missing data point would implicitly count toward the alarm’s “OK” state, regardless how the alarm threshold was defined!

The Epilogue

Ultimately, we followed their suggested workaround for state-to-state configuration. We understood this meant accruing some tech debt, but were comfortable with the decision given that the effort to unwind it would lessen significantly in the future.

Likewise, knowing that the CloudWatch team already had the “treat missing data as non-breaching” problem on their radar, gave us confidence that we wouldn’t have to address it ourselves. Rather than spend time now to devise a robust solution, we implemented a temporary workaround to handle our special case.

AWS constantly rolls out product updates that boost our development velocity. Having a window into the upcoming features is sort of the icing on the cake for us. It gives us that little bit of extra insight we sometimes need when deciding how we build something.