Automating and Scheduling a Linux Filesystem Sync with Google Cloud Storage

TL;DR: I learned how to migrate my legacy filesystem data over to Google Cloud Storage, using the Storage Transfer Service. Find below a script I wrote that will help you to do the same.

Motivation: I have a bunch of content scattered on various servers and hosting services, as I’ve been putting stuff online for a few decades — well before cloud was generally available. The setup is typically a Linux filesystem, and the host already has an HTTP server, or I can set one up. Some of these servers’ filesystems are still having files added/removed, and I want a way to get the result of this process, i.e. the filesystem contents, moved over to Google Cloud Storage. This grants a migration path forward for the applications, as I can eventually migrate the services modifying the filesystem, use CDN, etc.

It turns out that Google Cloud offers a Storage Transfer service, and that in addition to supporting import from AWS S3 buckets, there’s also a way to load data from HTTP servers. You can read about the transfer service here: https://cloud.google.com/storage/transfer/ and of relevance to this document, a protocol for describing objects to be imported over HTTP using the TsvHttpData format here: https://cloud.google.com/storage/transfer/create-url-list

Briefly, TsvHttpData requires a 3-column TSV file containing:

  1. URL of the object
  2. Size of the object, in bytes
  3. MD5 checksum of the object

You can then use gcloud to create a scheduled job that retrieves the TSV file and imports any missing/updated objects.

I wrote a script that will recurse through a subdirectory of files, find the files that have been modified since the last version of a TsvHttpData file was created, and update that TsvHttpData file with new information (newly created files, omitted files that no longer exist, updated MD5 checksums for modified files). It does the right thing to reduce I/O by not recalculating MD5 checksums for files that haven’t changed. Here’s the source in a GitHub gist:

The next thing to do with this script is put it on a cron, as well as a post-processing step that removes the last-modified timestamp that I added as an extra column to the TsvHttpData format (to save on I/O). The two commands look like:

#create the URL list, sizes, checksums, and last-modified timestamps
make_TsvHttpData.pl http://my.org/ ~/public_html/sync/ ~/THD.pre
#write the TsvHttpData file to a location visible over HTTP:
cat ~/THD.pre | cut -f 1,2,3 > ~/public_html/THD.tsv

Please let me know with a ❤ or a comment if you found this useful!