Are Your Backups Still Working? Are You Sure?

Alex Quach
4 min readMar 2, 2016
💩💩💩

Taking regular backups of important data is a crucial technique to ensure data integrity. Lack of backups can kill your company—just ask JournalSpace.

What’s worse than not making backups at all is setting up a backup solution that silently breaks—you think you’re protected, but when you need those backups most, they’re missing. It’s like biking every day with a helmet, only to find out the hard way that the helmet’s defective.

Finding out the hard way happens more often than you’d think, because there’s often a long time between when you first set up backups and when you need them, which creates a long window of opportunity for breakage. Plus, you’re unlikely to check in on your backups since they’re rarely at the forefront of your attention.

An easy and automatic way to make sure your backups are still working is to use a watchdog timer. A watchdog timer listens for incoming pings and alerts when they stop coming in as expected. If your daily backup script pings the watchdog timer whenever it finishes, then the watchdog can alert you whenever it hasn’t heard a ping in more than a day. Thus, you can tackle the issue as soon as the backups stop coming in as expected, which avoids an unpleasant surprise down the road.

How to set up a timer

The easiest way to set up a timer is to use a third-party monitoring service. If you already use a monitoring service that pings your systems to make sure they’re up, you can check if they support watchdog timers, also known as dead man’s switches or inbound kicks.

Once you’ve picked a service, it’s as easy as curling the watchdog URL whenever you’d like to send a ping. For example, for a simple backup script:

mysqldump -uroot -proot mydatabase > backup.sql && curl k.wdt.io/daily-backup

With this setup, the ping will be sent only if the backup was successfully made (exit code 0). In all other cases, such as the task not running or the dump failing, the ping will not be sent and the watchdog will alert you.

Incidentally, you may have already set up some sort of error monitoring that alerts you whenever the dump fails—something like this:

mysqldump -uroot -proot mydatabase > backup.sql || <ALERT! SOMETHING WENT WRONG!>

This approach detects the case when the script runs but fails to create a backup. However, it does not detect the case when the script does not run at all, which can happen for many reasons. For example, maybe you accidentally renamed the script, so your cron job can’t execute it. The alert above would never be triggered, and you’d never realize that something was wrong. Watchdogs guard against this case by detecting the interruption in successful pings and alerting you.

At a higher level, watchdogs are superior to catching errors because they’re an end-to-end test that your whole system is functioning. Watchdogs are a pessimistic functionality check: things are assumed broken unless you send a ping. In contrast, catching errors is an optimistic functionality check: things are assumed fine unless you alert. Pessimism is the right mindset for something so critical as backups.

At a higher level, watchdogs are superior to catching errors because they’re an end-to-end test that your whole system is functioning.

Watchdogs at Penny

At Penny, we use wdt.io, which has worked well for us. We have watchdogs set up for every scheduled job we run across our entire system, from weekly analytics reporters to webhook handlers. As soon as any job doesn’t report as expected, we get notified and can investigate. Bandwidth is precious for our small team, so we’ve found these watchdogs to be the perfect way to increase stability without impacting developer attention.

As it turns out, watchdogs can be used to monitor any operation that should be happening but isn’t, such as sending push notifications, handling API requests, or serving a page. If you expect to be sending many notifications every minute, then you can configure a watchdog to alert whenever you haven’t sent one in five minutes, which likely indicates an anomaly worth investigating. Push notifications are a particularly important case to instrument because a notification outage is much harder to detect than a site going down.

Watchdog timers are a critical component of system monitoring. Thanks to them, we can be sure that periodic jobs crucial to the system are still running. Hope you give them a shot!

Maybe they were using a watchdog 👌

If you enjoyed this, don’t be shy about 👏 for it!

As always, feedback is welcome: @alexquach or alex [at] pennyapp.io.

Alex, cofounder @ Penny.

--

--