Backups Are Fun!

Published in

Wealth Wizards Engineering

6 min readApr 28, 2018

Ok, so they’re not actually fun, but in a rather perverse way I’ve rather enjoyed my last week working on them.

For a while now we’ve had a couple of items on our backlog that we’ve been mostly pretending aren’t there becuase they say “backup”. We already firedrill our backups on a monthly basis but we were finding too often that a backup had failed to complete correctly, leaving the backup in an unhealthy state.

We were also starting to build up quite a collection of historical backups. Part of a belt-and-braces approach to the retention policy (and acceptable at the time with backup files small enough.)

This led to these two tasks being raised: “Implement new backup policy”: — thank you GDPR for ensuring this was required and “Backup Visibility”. You can see why we were falling over each other to be the one to pick them up.

We persist all our data in MongoDB, so our backup process really only concerns MongoDB at this point. To date we’d been using a slightly modified container found in GitHub. Essentially, the container is a cron process which triggers a MongoDB client to list the databases, backup each one and then hand-over to S3cmd.

S3cmd would then encrypt each backup file and push it up to S3, where it was stored until needed. The S3 buckets also encrypt-at-rest so we’re pretty happy with the degree of protection afforded to the data in these backups.

The container runs inside Kube and this is where our reliability problem started. The backup, encrypt, push-to-S3 process appeared to be memory hungry and this was causing the host OS to kill the container at the most critical moment; i.e. when the backup was taking place.

This was resolved by putting some memory and cpu limits (we already had requests in there) on the pod. It turns out that while the process would happily consume gobs of memory, it didn’t need to in order to run and it’s actually happy with ~500MB.

So, original problem solved but we still need to address visibility and policy. Yawn.

Let’s start with policy. We push the encrypted backups to S3 which already has a pretty cool (ok, nerd cool,) lifecycle policy feature. This allows us to determine the retention policy for objects and discard them at the appropriate point. This is exactly what we need to implement our backup policy!

In order to get it to work with our backups we needed to name the backups with something the S3 lifecycle could use, either S3 tags or file paths. The backup container only date-stamped the files. This works fine when you want all files to be deleted after say 14 days (fine in a dev environment, for example). To meet our backup policy, we needed different retention periods; 9 days for “daily” backups, 40 days for weekly, etc, etc.

To achieve this, we put a little logic in to the script

# Start with a default "daily" path and update this to something more specific if the date logic matchesPOLICY_CYCLE=daily# Check if it’s a MondayWEEKDAY=\$(date +\%A)
case \$WEEKDAY in
 “Monday”)
 POLICY_CYCLE=weekly
 ;;
esac# Check if it’s the first of the month and assign cycle accordingly (and over-ride weekly, it Monday is the first of the month)DAYMONTH=\$(date +\%m.\%B)
case \$DAYMONTH in 
 “01.January”)
 POLICY_CYCLE=yearly
 ;;
 “01.April”|”01.July”|”01.October”)
 POLICY_CYCLE=quarterly
 ;; “01.January”|”01.March”|”01.May”|”01.June”|”01.June”|”01.September”|”01.November”|”01.December”)
 POLICY_CYCLE=monthly
 ;;
esac

Sure, this could be formatted more neatly but we now have a POLICY_CYCLE variable which allows us to label each backup file with the appropriate label in order to apply a life-cycle policy to it.
We’ve decided to take the approach of using the POLICY_CYCLE to define the path of the backup file.

Now we can apply a simple lifecycle policy to the S3 bucket, and henceforth S3 will deal with the retention of our backups and also moving them into cheaper long-term glacier storage! Super cool! (again if you’re a nerd, like me.)

{
    "Rules": [
        {
            "Status": "Enabled", 
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 9
            }, 
            "Filter": {
                "Prefix": "daily/"
            }, 
            "Expiration": {
                "Days": 9
            }, 
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }, 
            "ID": "Daily Retention"
        }, 
        {
            "Status": "Enabled", 
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 40
            }, 
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "Filter": {
                "Prefix": "weekly/"
            }, 
            "Expiration": {
                "Days": 40
            }, 
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }, 
            "Transitions": [
                {
                    "Days": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "ID": "Weekly Retention"
        }, 
        {
            "Status": "Enabled", 
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 365
            }, 
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "Filter": {
                "Prefix": "monthly/"
            }, 
            "Expiration": {
                "Days": 365
            }, 
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }, 
            "Transitions": [
                {
                    "Days": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "ID": "Monthly Retention"
        }, 
        {
            "Status": "Enabled", 
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 2190
            }, 
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "Filter": {
                "Prefix": "quarterly/"
            }, 
            "Expiration": {
                "Days": 2190
            }, 
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }, 
            "Transitions": [
                {
                    "Days": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "ID": "Quarterly Retention"
        }, 
        {
            "Status": "Enabled", 
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 43800
            }, 
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "Filter": {
                "Prefix": "yearly/"
            }, 
            "Expiration": {
                "Days": 43800
            }, 
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }, 
            "Transitions": [
                {
                    "Days": 9, 
                    "StorageClass": "GLACIER"
                }
            ], 
            "ID": "Yearly Retention"
        }
    ]
}

With the policy problem sorted, now we just had to figure out how to make the backup status more visible.

There were quite a few options we ruled out; We could have sent an email each time it ran, or each time the bucket was updated but we don’t want to be notified each time something is doing what it should. Instead we wanted to know when it had failed, and be able to refer to an at-a-glance dashboard for status.

We already use influxDB along with Grafana to chart a number of system metrics and we were pretty sure we could do something with this to display our backup status.

The reasoning behind the approach was simple: emit an event to InfluxDB each time and instance backup completed successfully. If the backup failed, no event would be emitted. We then add a status page in Grafana showing all the events for instance backups int he past 24 hours. If there was no event for an instance, the staus would be red.

Next we had to figure out how to achieve this!
It turns out sending events to Influx is pretty simple and we could achieve it with a simple curl request:

# Define the command
POST2INFLUX="curl -XPOST --data-binary @- https://influxhost.example.com/write?db=backups"# Use it when the backup completes successfully:
echo -n "instance_backup_completed,instance=${REPLICA_SET} value=true" | $POST2INFLUX

InfluxDB is happy to accept a boolean value into a data series, as long as all values in the series are boolean. We only wanted to record if a status was true or false so this was perfect.

Now all we needed to do was build a page in Grafana to display this info. Fortunately Vonage have written a really neat status panel plugin for Grafana and InfluxDB. All we had to do now was figure out how to consume a boolean value to trigger a status change.

It took a while to get there, but it turns out you need to contort your logic into an odd combination of boolean, string and float values.
Influx DB recognises the series as containing boolean values (we used true/false). Granfa will plot the data as 0 or 1 (I had to chart the data with a heatmap to figure this out.. )Status panel recognises the data as strings, but the threshold function in status panel treats the data as a float!

I the end we set a threshold of 0.5 for critical; i.e. If we had a “true” value in the past 24-hours, then the status was OK/Green. If there was no value (i.e. 0) then status is Critical/Red. We set the threshold at 0.5 to try and add some clarity around this logic.

So now we have a simple, at-a-glance dashboard in Grafana to tell us if a backup failed, along with an alert (driven from the standard graph panel plugin, but with essentially the same logic). We’ve also extend this method out so we can monitor successful snapshots of our consul K/V store too. So it’s already proving to be pretty flexible.

2 backups stories completed in one week and the best part? I actually quite enjoyed it all!

We’ll be aiming to open source the mongoDB container shortly so others can benefit from this should they face the same kind of problem. The original container was pulled from a Tatum Labs repo on GitHub but I’m no longer able to track down this so my apologies. please provide me with a link if this looks familiar!

Backups Are Fun!

Written by Rich Marshall