Automating and Scheduling a Linux Filesystem Sync with Google Cloud Storage
Motivation: I have a bunch of content scattered on various servers and hosting services, as I’ve been putting stuff online for a few decades — well before cloud was generally available. The setup is typically a Linux filesystem, and the host already has an HTTP server, or I can set one up. Some of these servers’ filesystems are still having files added/removed, and I want a way to get the result of this process, i.e. the filesystem contents, moved over to Google Cloud Storage. This grants a migration path forward for the applications, as I can eventually migrate the services modifying the filesystem, use CDN, etc.
It turns out that Google Cloud offers a Storage Transfer service, and that in addition to supporting import from AWS S3 buckets, there’s also a way to load data from HTTP servers. You can read about the transfer service here: https://cloud.google.com/storage/transfer/ and of relevance to this document, a protocol for describing objects to be imported over HTTP using the TsvHttpData format here: https://cloud.google.com/storage/transfer/create-url-list
Briefly, TsvHttpData requires a 3-column TSV file containing:
- URL of the object
- Size of the object, in bytes
- MD5 checksum of the object
You can then use gcloud to create a scheduled job that retrieves the TSV file and imports any missing/updated objects.
I wrote a script that will recurse through a subdirectory of files, find the files that have been modified since the last version of a TsvHttpData file was created, and update that TsvHttpData file with new information (newly created files, omitted files that no longer exist, updated MD5 checksums for modified files). It does the right thing to reduce I/O by not recalculating MD5 checksums for files that haven’t changed. Here’s the source in a GitHub gist:
The next thing to do with this script is put it on a cron, as well as a post-processing step that removes the last-modified timestamp that I added as an extra column to the TsvHttpData format (to save on I/O). The two commands look like:
#create the URL list, sizes, checksums, and last-modified timestamps
make_TsvHttpData.pl http://my.org/ ~/public_html/sync/ ~/THD.pre
#write the TsvHttpData file to a location visible over HTTP:
cat ~/THD.pre | cut -f 1,2,3 > ~/public_html/THD.tsv
Please let me know with a ❤ or a comment if you found this useful!