Scheduled Mirror/Sync SFTP to GCS
Context
Applications often require loosely coupled file sharing systems. Sometimes data files need to be shared across companies or teams within company which do not share storage infrastructure. SFTP has been common interface in such cases.
Why Sync files to Google Cloud Storage ?
While applications on GCP may directly read from SFTP server. But in several cases syncing the files to GCS could be preferred approach. Reasons why you would consider setting up sync :
- Consistent files source for application
- Fine grained access control list (ACL)
- Easy life cycle management of files
- Mitigate network flakiness
Solution overview
- gcsfuse to mount bucket on VM
- schedule lftp command to mirror files periodically
- No Coding needed, using only tools.
Let’s get started
If you have not used Google Cloud, you can head on over to https://console.cloud.google.com and register for a free account starting with $300 in credit.
- Open cloud shell.
2. Create a new bucket with unique name.
gcs_bucket=”sftp_bucket_”$(python -c “import uuid; print str(uuid.uuid4())”)
gsutil mb gs://$gcs_bucket
3. Create a new VM instance
VM instance can be created either by UI or as gcloud command as below.
Incase of UI, change access scope to Access scopes -> Storage -> Full
Alternatively execute below command from cloud shell.
gcloud compute instances create sftp-test --zone=us-central1-c --machine-type=n1-standard-1 --subnet=default --scopes=https://www.googleapis.com/auth/cloud-platform --image-family=ubuntu-1604-lts --image-project=ubuntu-os-cloud --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=sftp-test
Above command will create VM in default network and subnet.
4. SSH into VM
Type below command in cloud shell to SSH
gcloud compute ssh sftp-test
5. Install lftp on VM
Lftp utility is swiss army knife of file downloading, it works with lot of protocols including sftp.
sudo apt-get install lftp
6. Install gcsfuse on VM
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -sudo apt-get update
sudo apt-get install gcsfuse
7. Mount bucket on VM
mkdir ~/gcsBucket
gcsfuse <replace-by-bucket-name> ~/gcsBucket
Replace <replace-by-bucket-name> by bucket name (created in step 2). Do not use gs://
8. Schedule lftp command
Assuming sftp authenticates with username and password. Create script as below.
crontab -e
This will open up a file to schedule cron job.
0 0 * * * lftp sftp://<username>:<password>@<sftp-server-ip/domain> -e "set sftp:auto-confirm yes; mirror --verbose /path/on/sftp ~/gcsBucket ; bye"
Above job is scheduled at 00:00 hrs everyday (based on VM timezone, usually UTC).
Replace <username> with sftp username
Replace <password> with sftp password
Replace <sftp-server-ip/domain> with sftp url/ip
Replace /path/on/sftp with path on sftp server