Building Data Pipeline with Storage Trigger Using SFTP Access to Azure Data Lake Gen2

Liaquath Chowdhury
4 min readAug 13, 2022

--

Storage trigger is one of the most common scenarios when it comes to automating a pipeline in Azure Data Factory and Azure Synapse Analytics solutions. This enables connecting activities in the data pipeline and covers the events like blob created, blob deleted, etc. But when it comes to creating blobs, until most recently, Microsoft did not have fully managed SFTP services in Azure. The way to access the Azure blob storage was limited to Azure Blob service REST API, Azure SDKs, and tools such as AzCopy unless a custom solutions is in place. The custom solutions involve creating virtual machines in Azure to host an SFTP server, and then update, patch, manage, scale, and maintain a complex architecture.

The good news is that blob storage now supports the SSH File Transfer Protocol (SFTP). This provides the ability to securely connect to Blob Storage accounts via an SFTP endpoint, for file transfer, and file management. In this project, we will be using SFTP to securely connect and create a blob, which will be then used as an event to trigger the data pipeline in Azure Data Factory. As this focuses mainly on SFTP access to Azure Blob Storage, a simple pipeline will be created just for the purpose of demonstration.

Setting up the storage account: To enable SFTP service, we need to first enable hierarchical namespace in our storage account. The hierarchical namespace organizes objects (files) into a hierarchy of directories and subdirectories. This scales linearly and doesn’t degrade data capacity or performance. We can then enable SFTP for the account.

Create user for the service: Once the storage account is ready, we can create users. Now Azure Blob Storage doesn't support the Active Directory for this service yet, instead they come up with an idea of local user. So lets create a local user for this storage account to pass access. Now, there are password and public-private key pair based authentication for the local user. Also the access is limited to container level for now. We will create a local user using password and set access level to write for creating blobs.

Lets begin with the “SFTP(preview)” in the Settings menu in the storage account. Click “Add local user”. We need to provide a name and authentication method in the “Username + Authentication” tab. Next, in the “Container permissions”, select the container and the permissions, then select “container/directory” as home directory. This is where the user will have access to. Click Add. Once done, we can get the username and host name (connection string) to use in our SFTP connection.

Prepare Data Factory pipeline trigger: Now its time to add the event trigger in our pipeline. We can go to the pipeline in the Data Factory, then click “New/Edit” in the “Trigger” menu. For the type, choose “BlobEventsTrigger”. When we select the subscription, the account name with the container name will be popped up. Important is, choose the “Blob path begins with” as the directory name inside the container and “ends with” can be the file extension, .zip in our case. For the event, we will choose “Blob created”, as our aim is to run the pipeline once a zipped file is being uploaded into the container.

All done! The pipeline is ready to run by the trigger. To upload files into the blob storage the local user can use any tools like FileZilla, command line etc. Once a file is loaded the pipeline will run as expected. A sample example using command line is:

#connect to the remote host server
sftp remote_username@server_ip_or_hostname
Connected to remote_username@server_ip_or_hostname.
sftp> put filename.zip
Uploading filename.zip to /home/remote_username/filename.zip
filename.zip 100% 12MB 1.7MB/s 00:06

We’ve established the connection to Azure Blob Storage via SFTP protocol. This is a great feature in terms of connectivity and collaboration with third parties and multiple teams in a big organization, where secure access is key to many data pipeline solutions, avoiding manual or complex file transmission processes.

--

--