Azure Data Factory — Loading files into Google Cloud Storage and Amazon S3

Published in

Couchdrop

4 min readMay 31, 2021

Using Microsoft Azure’s Data Factory you can pull data from Amazon S3 and Google Cloud Storage to extract into your data pipeline (ETL workflow). However, Microsoft does not allow you to load (put/upload) files back into these platforms at the end of your extract, transform, load cycle.

To get past this and to enable the ability to load files back into these platforms you can utilise the SFTP connector and Couchdrop. Couchdrop is a cloud SFTP / FTP conduit that acts as a fabric on top of cloud storage and offers webhooks, an API and supports web portal uploads, etc. In this case Couchdrop supports Google Cloud Storage, Amazon S3, SharePoint, Dropbox and anything in between. Using Couchdrop with Azure Data Factory you can pull data from any cloud storage platform, transform it and then load it to the same or a different cloud platform, all through SFTP.

ETL Use Case Examples:

As a vendor your clients send you files via SFTP (or another means such as web portal) where you can receive a webhook event on upload to then initiate your ETL process.

sftp data factory etl — Have your clients send you data via SFTP to then be processed through automated ETL operations

Have your clients send you data via SFTP to then be processed through automated ETL operations

As a client you can expose your data to your vendor for them to then process the uploaded file on a webhook event.

Expose data to your vendor to be automatically pulled into ETL operations

The Steps:

Step 1. Configure storage in Couchdrop
Step 2. Configure user(s)
Step 3. Configure webhooks (optional)
Step 4. Configure Couchdrop’s SFTP in Data Factory

Configuring Couchdrop is straightforward and only takes a couple of steps. You can create users who are locked to specific buckets and are limited to specific file operations (upload only, download only, read/write, etc.). As well configure webhooks based on upload/download events on certain folders. This enables you to trigger different workflows based on the uploaded folder and user. Couchdrop also offers an API to assist with onboarding users programmatically.

Step 1. Configure storage in Couchdrop

Navigate to your storage portal and configure a new storage connector. Below we are configuring Google Cloud Storage.

Connecting Google Cloud Storage in Couchdrop as an SFTP endpoint

Step 2. Configure user(s)

For this example we have created a user (gcsuser) who can only upload data to the gcs bucket we created above. In theory this could be an external party uploading data. You could create another user who has read/write access who can pull the data down based on the webhook event of the ‘gcsuser’ uploading a file and extract it into your workflow.

couchdrop cloud ftp — Configuring user who can only upload to Google Cloud Storage

couchdrop cloud sftp — Configuring additional settings for Couchdrop SFTP user

Step 3. Configure webhook (optional)

Under the specific folder in Couchdrop’s SFTP virtual file system you wish to send a webhook on — select the event you wish the webhook to be sent on and the URL and save.

data factory webhook etl — Configuring webhook under folder in Couchdrop’s Virtual File System

Sample Couchdrop SFTP webhook output:

{
  "account": "demouser",
  "filename": "/demo/customers/bobsburgers/burgersaucereceipe.txt",
  "authenticated_user": "demo1",
  "storage_engine": "hosted",
  "storage_engine_id": "7e88f06d-3aa5-45d9-97c2-3c5fa28ca0b4",
  "event_type": "upload",
  "ip_address": "123.253.47.202",
  "success": true,
  "total_size": 40,
  "additional_info": "",
  "system": "sftp",
  "transaction_id": "836851c7-f745-4476-8a0a-b4df14c4cd0e",
  "region": "us1",
  "text": "File /demo/customers/bobsburgers/burgersaucereceipe.txt uploaded by demo1 via sftp from 123.253.47.202"
}

Step 4. Configure Couchdrop’s SFTP in Data Factory

As Couchdrop SFTP works as a standard SFTP server, you simply need the hostname (sftp.couchdrop.io) and your Couchdrop SFTP’s user credentials.

couchdrop sftp cloud ftp — Configuring Couchdrop’s SFTP in Microsoft Data Factory

To get up and running with Couchdrop’s cloud SFTP server and integrate it into your ETL process, navigate to Couchdrop’s website to sign up or learn more.

On a final note, Couchdrop is simply a conduit and does not store data, nor does it ‘sync’ data to storage platforms. It processes transfers in memory directly to your endpoint which is overwritten.

Azure Data Factory — Loading files into Google Cloud Storage and Amazon S3

ETL Use Case Examples:

Step 1. Configure storage in Couchdrop

Step 2. Configure user(s)

Step 3. Configure webhook (optional)

Step 4. Configure Couchdrop’s SFTP in Data Factory

Published in Couchdrop

Written by Couchdrop

No responses yet