Facebook’s Prineville Data Server Center

Creating PULSE — A developer-friendly server monitoring tool (Part 5)

Matt Kingshott
Dec 21, 2018 · 6 min read

This is part of a weekly development blog series, where I will document the creation of an application from the initial idea through to its deployment on a scalable architecture. Even as an experienced developer, I find these stories to be interesting and I usually pick up a tip or two, so if you’d like to come along, and hopefully benefit in some way, let’s dig in!

NOTICE: Pulse has now launched and is available to use. You can create an account by visiting https://pulse.alphametric.co


Building a smart notifier

If you consider how Pulse works, it has the potential to be in a position where it could send out thousands of notifications every hour. Depending upon the data it receives, it could send out a great many identical notifications to users, essentially becoming a paid SPAM service!

Obviously, this is far from ideal. Instead, we have to triage these notifications:

  1. We have to employ minimum time intervals between sending out the same notification e.g. alert someone once an hour about high CPU usage.
  2. We have to enforce a daily limit for each monitor. A user doesn’t need to know the same thing more than X number of times per day.
  3. We have to group the monitor data for a server where possible to further reduce the number of notifications sent out.

Fortunately, Laravel’s collections allow us to design a step by step approach for this. Consider the following pseudo code:

// Retrieve the monitors which have "bad" statistics
monitors = getMonitorsWithProblems()
// Filter monitors which are not permitted to receive a notification monitors -> filter() -> hasEnoughTimePassed()
monitors -> filter() -> withinLimit()
// Create a new notice that a notification will be sent
monitors -> each() -> storeNotice()
// Group the monitors to reduce the number of notifications
monitors -> groupBy(channel, route)
// Send the notifications for the monitors
monitors -> each() -> notify()

Now, let’s break down how the actual implementation for this would work. We have a method that pulls in the monitors with data that violates the user-specified threshold (and therefore requires sending a notification). Then…


Filtering out inappropriate monitors

We need to filter out the monitors that we already sent out a notification about recently. This is fairly simple to implement. Just add an interval column to the monitor table and assign it a number (in minutes).Then add a notified_at timestamp column to the table and update it when necessary:

Method to check if enough time has passed since the last notification

However, that’s only part of the solution. If we left it with this, the system would continue to send out notifications, but simply after each interval. We need to enforce a maximum number to send per day.

For that, we need a notices table that links to the monitors table and contains records with nothing other than timestamps. We also need to add another column to the monitors table, which like interval, is a number that specifies the maximum number of notifications that may be sent out per day.

Then all we have to do, is check whether the number of notice records for a given monitor is less than the daily maximum:

Method to check if the daily limit of notifications has been reached

If a monitor satisfies the criteria for both of those filters, we can move on.


Storing a notice for the monitor

As you probably guessed, the second filter, left as is, will always return true, which is not what we want. If we pass the conditional, then we need to go ahead and increment the number of notices for the monitor, as well as update the monitor’s notified_at timestamp:

Keeping tabs on the notifications being sent

Grouping notifications together

Suppose we have bad data for a server’s CPU, storage disk and inbound network traffic, and suppose the user has configured the monitors for each of those items to send a notification by email to the same address.

It wouldn’t make sense to send a separate email for each item. Instead, we should send one email containing a summary of all three. Fortunately for us, Laravel allows us to group the collection however we need to.

First, let’s group by the notification channel (email, Slack etc.) and then we’ll group those nested items by the notification route (e.g. email address):

Grouping the data by channel, then route

Finally send the notifications

Now that we’ve satisfied all the filtering criteria and organised the statistical data into groups, we can finally send out the notifications! Here’s an example of the email notification, with the data nicely presented:

An email notification sent by Pulse

Taking stock of the changes

At this point, we’ve done a lot to move ourselves away from the basic process at the start of the article that would have mindlessly sent out notifications without a second thought. We can illustrate the change with some numbers:

Assuming we had bad data for a server’s CPU, storage disk and network in traffic, and that bad data was continuously coming in for 24 hours, then Pulse would have sent out 3 separate emails, every 2 minutes for that period.

That’s a whopping 2,160 emails!

Aside from the financial cost of that, I’m pretty sure our mail provider would have disabled our account under the belief we were sending out SPAM.

Now, let’s examine the same scenario, but with Pulse’s new smart notifier powering the decision making. Assuming that the monitor interval is 60 (so, once an hour) and the daily maximum limit is 5, then Pulse would have sent out 1 email, every 60 minutes, up to a maximum of 5 times.

That’s a reduction of 2,155 emails!

We’re still conveying the same information, but in a manner which is both practical and far easier for the user to interpret.

Now, to be fair, the number could be higher if, for example, you assigned a different email address for each monitor, however you’re probably unlikely to use that approach in a real world setting.

A more realistic scenario, is assigning higher priority monitors e.g. CPU, to a channel like Slack to garner a faster response, while disk usage is probably less critical at that exact moment and is therefore fine with an email.

Ultimately, Pulse allows you to do what you want. Assign different channels for every monitor, or use one for fall. It’s entirely up to you.


Wrapping Up

Well, that’s it for this week. Next up, we’ll be looking at another feature that Pulse will be supporting… silent servers. Our notifier works when it receives data, but what if it doesn’t get any? In that situation, we need to inform users that we’re not receiving statistics and that maybe their server is down.

All that is coming in next week’s article. In the mean time, be sure to follow me here on Medium, and also on Twitter for more frequent updates.

NOTICE: Pulse has now launched and is available to use. You can create an account by visiting https://pulse.alphametric.co

Thanks, and happy coding!

Matt Kingshott

Written by

Senior developer at @alphametric_co. Generally working with PHP / Laravel / Vue. Open source fanatic.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade