How to make Ubuntu backups using Duplicity and Google Cloud Storage

This is my first medium article. I will mostly write about technical problems wI encountered during my work as a developer at the Open State Foundation.

We recently had to change our backup strategy. We used to backup our servers to a server located on-site at our office using BackupPC. This became too expensive when our landlord changed to a new ISP which drastically increased bandwidth and colo prices. We thus had to store our backups off-site. We thought about using a dedicated server, but Google Cloud Storage is cheaper given the amounts of data that we want to backup (2 TB). This is mainly due to the prices Google specifically offers for backup/archival purposes via its Nearline storage ($0.01 per GB per month), though Amazon offers a similar service called Glacier. We could also select Coldline, but that requires you to store the data for at least 90 days before deletion, while Nearline requires just 30 days and this is enough for our backups.

We want to encrypt our data before backing them up to Google’s servers and chose Duplicity as it is open source, is space efficient, has been continuously developed since (2002, though really since 2007) and supports Google Cloud Storage.

Creating backups

First, create a Google Cloud Platform (GCP) account and enable Google Cloud Storage (GCS) and create a new project. Then Create a bucket and use a unique and nonsensitive name as you share the namespace with everyone else using GCS! We select Nearline as the ‘Default storage class’. Take good care when selecting the location though! Some locations are more expensive than others, e.g., europe-west-2 costs $0.16 per GB per month while europe-west-1 costs $0.01. So determine which on continent you want to store your data and choose the cheapest location if the exact location doesn’t matter.

To let Duplicity access GCS we are going to add a new member to the project (this has to be an existing Google account). Do this on the IAM & Admin page. We give this new member just two restrictive roles (i.e., permissions), Storage Object Creator and Storage Object Viewer. This makes sure that the files/backups can only be created from the server and cannot be overwritten or deleted (e.g., by a hacker who gained access to the server).

Now log in as the newly created member and go to ‘Storage’ > ‘Settings’ and select the ‘Interoperability’ tab. Click on ‘Enable interoperability access’ and create a new key.

Now SSH into you server and install Duplicity and boto:

sudo add-apt-repository ppa:duplicity-team/ppa
sudo apt-get update
sudo apt-get install -y duplicity
sudo pip install boto

Add environment variables with the Interoperability access keys you’ve just created and supply a secure passphrase (this is used to encrypt/decrypt your backups). Make sure though that the following line starts with session instead of auth in /etc/pam.d/sudo. Open /etc/environment and add the following lines:

GS_ACCESS_KEY_ID="<your_access_key>"
GS_SECRET_ACCESS_KEY="<your_secret>"
PASSPHRASE="<your_passphrase>"

Add the following daily cronjob, which will make a full backup of /home if none exists, perform daily incremental backups and make another full backup one month after the last full backup. Run sudo crontab -e and add the following line:

0 0 * * * sudo sh -c ‘duplicity --archive-dir /root/.cache/duplicity/ --full-if-older-than 1M /home gs://<bucket_name> >> /var/log/duplicity.log’

Always use --archive-dir /root/.cache/duplicity/ in your Duplicity commands. This makes sure that Duplicity uses the same local cache even if another user runs Duplicity commands. If you don’t do this, Duplicity will rebuild the local cache by downloading files from the cloud storage which is something that you want to avoid as downloading from GCS Nearline incurs costs. Furthermore, you can exclude directories by adding the --exclude option. Finally, specifying a wrong bucket name results in the creation of a new bucket(!) on which the action will be applied! So always check that you specify the right bucket name!

Removing old backups

GCS Nearline requires you to store your data for at least 30 days (if you delete it earlier then you still pay for 30 days). Because we create a new full backup every month we will add a deletion policy for files older than 40 days. This gives us a least 10 days to notice a situation where we need to access our backups. We could use Duplicity for the removal of old backups, but we don’t want the server to be able to do this as a hacker could abuse this to mess with our backups.

Instead we use lifecycle management offered by GCS. These following steps need to be executed on your own desktop/laptop where you need to install gsutil (I recommend to follow the ‘Installing gsutil as part of the Google Cloud SDK’ section). Create a file called gcs-lifecycle-management.json with the following content:

{
"lifecycle": {
"rule": [
{
"action": {"type": "Delete"},
"condition": {
"age": 40
}
}
]
}
}

Set the lifecycle policy with this command:
gsutil lifecycle set gcs-lifecycle-management.json gs://<bucket_name>

You might want to setup logging and billing exports to get more insight into your backup process and costs. Regarding the billing export to BigQuery, have a look at this neat reusable dashboard.

Restoring backups

The following commands will show you how to restore just a single file or a whole backup. If you perform these actions on the server then skip this section and continue to the Duplicity commands below. If you want to perform them on another machine (e.g., because the server is compromised and you need to restore to a different location) then follow these steps first. First install the Duplicity and boto. Then create a file, let’s call it env.txt, where you add the following lines:

export GS_ACCESS_KEY_ID="<your_access_key>"
export GS_SECRET_ACCESS_KEY="<your_secret>"
export PASSPHRASE="<your_passphrase>"

Then run source env.txt and add -E after sudo in the restore commands below. You might want to shred the env.txt file once the restore is complete using shred -uz env.txt.

Show collection status
Use the following command to show the collection status (e.g., how many full and incremental backups are there and when were they made):

sudo duplicity collection-status --archive-dir /root/.cache/duplicity/ gs://<bucket_name>

How to restore a single file
First list all files to retrieve the path of the file you want to restore (unless you know the path of course):

sudo duplicity list-current-files --archive-dir /root/.cache/duplicity/ gs://<bucket_name>

Then use the path of the file you want to restore in the following command:

sudo duplicity restore --archive-dir /root/.cache/duplicity/ --file-to-restore <path> gs://<bucket_name> <filename_to_restore_to>

How to restore all files
Restoring all files is as easy as this:

sudo duplicity restore --archive-dir /root/.cache/duplicity/ gs://<bucket_name> <dir_to_restore_to>

The restore commands above let you restore from the latest backup. If you want to use an older backup listed using the collection-status command then add the--time option to specify that backup. See the TIME FORMATS section in man duplicity on how to format a time string.

In extremely critical situations it might be useful to remove the lifecycle management policy that deletes files after 40 days. Simply use the same gsutil lifecycle set gcs-lifecycle-management.json gs://<bucket_name> command as shown above but this time enter an empty JSON object, {}, in gcs-lifecycle-management.json.

Obstacles and annoyances

GCS works great, but still needs some polishing. Trivial tasks can’t be performed directly via their website. E.g., you can’t sort on any of the columns you see in the image below, you can’t download multiple files at once and there is no select all button. It wasn’t trivial to figure out the correct roles to select and combine in order to get the behavior I wanted. The documentation is good but could be more detailed at places with more examples/different use cases. Also, useful features are still in alpha/beta, e.g. custom roles, so I expect things to improve!

Making backups this way works smoothly for us. The one thing that still mystifies me is that the billing section of GCP doesn’t reflect the costs of our usage as it is still at $0.00. I have been in contact with Google’s support for one week now and that has been unsatisfying experience. While they are superduperfriendly, it seems they lack the skills to efficiently handle this issue. They repeatedly ask for the same information and after I was finally upgraded to the Tech Team they started to suggest ignorant solutions as well (“If you have uploaded your data in the last two days the costs might not be reflected yet”,… I have been in contact with you for one week already!).

I have had bad experiences with Google support for some of their non-paying consumer services, but I expected better support as a paying business customer. It would be really nice if Google redirects some of its billions of profit it makes each year to improve their support.

Update: After 1.5 weeks Google support came to the conclusion that usage costs of GCS Nearline are not supposed to be updated. Which is weird as there are several ways to view your current costs, but I could not find any disclaimers stating that costs of Nearline should not be displayed. Again, it would be nice if Google added this feature :).

Update 2: Finally after more than three weeks the billing issue has been resolved! I must say Google’s support team stuck with this problem and kept me updated on the progress every couple of days. Thanks! In the end I had to enable the Google Cloud Storage API and together with some changes on their part I now finally see the costs that I’m making :D! Funny enough the costs were actually not registered until this fix, so I got 3–4 weeks of free storage XD. I’m glad I can now see the breakdown of my usage and the related costs though!