Facebook’s Prineville Data Server Center

Creating PULSE — A developer-friendly server monitoring tool (Part 6)

Matt Kingshott
Dec 27, 2018 · 5 min read

This is part of a weekly development blog series, where I will document the creation of an application from the initial idea through to its deployment on a scalable architecture. Even as an experienced developer, I find these stories to be interesting and I usually pick up a tip or two, so if you’d like to come along, and hopefully benefit in some way, let’s dig in!

NOTICE: Pulse has now launched and is available to use. You can create an account by visiting https://pulse.alphametric.co


Servers giving us the cold shoulder

So far, Pulse has been engineered to respond to data sent by servers. In other words, it is reacting to stimuli. However, there’s also another question that we should be asking ourselves:

What happens when a server doesn’t send data? Or, more specifically, what happens when it is silent?

Users will need to be able to rely on Pulse not just to monitor the individual components of their servers, but also their servers as whole.

Before we dig into the technical solution for that, let’s take a look at some of the reasons why we might class a server as silent:

  1. The server may be shutdown / powered off.

One or more of the above will mean that Pulse isn’t able to do its job. Since we’re not going to be running diagnosis software on the server, Pulse will instead need to take a blind approach to this and simply assume silent status regardless of the root cause.

Now that we have the background, let’s take a look at the implementation we can use to handle scenarios where servers are silent.


Discovering the silent servers

Laravel includes a task scheduling component, which we can use to fire off a job or console command at a pre-defined interval:

$schedule->command("app:silent-servers")->everyFiveMinutes();

As a side note, I prefer to place such functionality inside a console command as it allows me to easily trigger it through the Artisan CLI.

So, how do we discover which servers are silent? Well, fortunately, we have a database relationship between servers and logs to help us with that. Since our logs contain timestamps, we can simply retrieve the latest one we have and use it to infer when we last heard from a server.

The base query for finding silent servers

Let’s break this down a little. Firstly, we’re skipping servers that are already marked as silent, as well as servers which are new and yet to be fully set up by the user. Next, we’re pulling in the columns we need, and finally, we’re pulling in the latest available log for the server.

Once we have these results, we’re chunking them. This is a Laravel feature that helps to prevent memory exhaustion when dealing with potentially thousands of records. Instead, it will pull in 25 records, pass them through the closure, then pull in the next 25 with another query, and loop until complete.


Reporting on silent servers

At this point, we don’t actually have confirmed silent servers. We only have potentially silent servers. We now need to review the timestamp of the recent log and determine if the difference from the current time exceeds the limit that Pulse will tolerate before considering it silent.

It is important that we not be too strict with the limit, as the reality of the real world, is that delays happen. That said, we can’t be too liberal either. Finding the sweet spot is something that will likely be determined over time, but for now, we’ll be starting with ten minutes. If you have an opinion on this, I’d love to hear it!

Let’s take a look at the code that makes this work:

The process for dealing with silent servers

Let’s break this down and see what’s actually happening:

  1. We filter the collection to only include servers without a log, or servers with a latest log that was created ten or more minutes ago.

Making things right again

You may be thinking that if this is all the code being used, once a server is marked as silent, won’t it permanently stay that way? Thankfully, no.

Pulse’s logging system is independent of the silent server checking feature. As such, all a user needs to do to “fix” a silent server, is simply correct whatever problem led to the server not sending data.

When Pulse starts receiving data again, it will parse it and then update the server’s status to good or bad. This is the same approach used by Pulse when you’re setting up a new server. It’s status will remain new until Pulse starts receiving data, then it will simply update it.

If you’re curious, here’s the code Pulse uses when updating a server’s status:

The logic Pulse uses to set a server’s status

Wrapping Up

Well, that’s it for this week. Next up, we’ll be stepping away from the app itself and examining the shell script that users will run on their servers in order to send statistics to Pulse. We’ll also discuss the motivation behind this strategy, as well as sudo permissions and how they contributed to this approach.

All that is coming in next week’s article. In the mean time, be sure to follow me here on Medium, and also on Twitter for more frequent updates.

NOTICE: Pulse has now launched and is available to use. You can create an account by visiting https://pulse.alphametric.co

Thanks, and happy coding!

Matt Kingshott

Written by

Senior developer at @alphametric_co. Generally working with PHP / Laravel / Vue. Open source fanatic.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade