Scheduled Mirror/Sync SFTP to GCS

Context

Applications often require loosely coupled file sharing systems. Sometimes data files need to be shared across companies or teams within company which do not share storage infrastructure. SFTP has been common interface in such cases.

Why Sync files to Google Cloud Storage ?

While applications on GCP may directly read from SFTP server. But in several cases syncing the files to GCS could be preferred approach. Reasons why you would consider setting up sync :

  • Consistent files source for application
  • Fine grained access control list (ACL)
  • Easy life cycle management of files
  • Mitigate network flakiness

Solution overview

  • gcsfuse to mount bucket on VM
  • schedule lftp command to mirror files periodically
  • No Coding needed, using only tools.

Let’s get started

If you have not used Google Cloud, you can head on over to https://console.cloud.google.com and register for a free account starting with $300 in credit.

  1. Open cloud shell.

2. Create a new bucket with unique name.

gcs_bucket=”sftp_bucket_”$(python -c “import uuid; print str(uuid.uuid4())”)
gsutil mb gs://$gcs_bucket

3. Create a new VM instance

VM instance can be created either by UI or as gcloud command as below.
Incase of UI, change access scope to Access scopes -> Storage -> Full
Alternatively execute below command from cloud shell.

gcloud compute instances create sftp-test --zone=us-central1-c --machine-type=n1-standard-1 --subnet=default  --scopes=https://www.googleapis.com/auth/cloud-platform --image-family=ubuntu-1604-lts --image-project=ubuntu-os-cloud --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=sftp-test

Above command will create VM in default network and subnet.

4. SSH into VM

Type below command in cloud shell to SSH

gcloud compute ssh sftp-test

5. Install lftp on VM

Lftp utility is swiss army knife of file downloading, it works with lot of protocols including sftp.

sudo apt-get install lftp

6. Install gcsfuse on VM

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse

7. Mount bucket on VM

mkdir ~/gcsBucket
gcsfuse <replace-by-bucket-name> ~/gcsBucket

Replace <replace-by-bucket-name> by bucket name (created in step 2). Do not use gs://

8. Schedule lftp command

Assuming sftp authenticates with username and password. Create script as below.

crontab -e

This will open up a file to schedule cron job.

0 0 * * * lftp sftp://<username>:<password>@<sftp-server-ip/domain>  -e "set sftp:auto-confirm yes;  mirror --verbose /path/on/sftp ~/gcsBucket ;  bye"

Above job is scheduled at 00:00 hrs everyday (based on VM timezone, usually UTC).
Replace <username> with sftp username
Replace <password> with sftp password
Replace <sftp-server-ip/domain> with sftp url/ip
Replace /path/on/sftp with path on sftp server

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store