File Transfer from Azure BLOB to AWS S3 : Step-by-Step Guide
Now-a-days, many organizations are following a Multi-cloud approach - a strategy which leverages cloud computing services from at least two different cloud providers. This gives organizations the flexibility to optimize performance, control cost and avoid vendor lock-in. In such Multi-cloud scenarios, we often reach a point where we need to transfer data from one cloud platform to another.
Here, we will be discussing about such an approach where we need to transfer data from Azure BLOB storage to AWS S3 bucket.
On a high level, below are the steps:
- Create a Python script to transfer file from BLOB to S3 (Script is given at the end of this post).
- Create an Azure Batch account and configure the batch pool.
- Create an ADF pipeline with a custom activity, and connect to Azure batch to run the data transfer script.
Now, on a detailed level, below are the steps that could be followed:
- Create an Azure Batch account by providing necessary details.
- Go to the resource, and under Pools tab, create a new pool by providing required details as shown below:
- In the next step, make sure to provide at least 2 target dedicated nodes.
- Next, configure the start task. Follow the below link to configure the start task: https://techcommunity.microsoft.com/t5/azure-paas-blog/install-python-on-a-windows-node-using-a-start-task-with-azure/ba-p/2341854
- After this, submit the details.
- Once the nodes are up and running, and status becomes steady, we are good to use the Azure batch.
- From the keys section of Azure batch created, copy the primary access key.
- Now, in ADF, create a Custom activity and provide the Batch account details. Key and end point details are available in the Key tab in Azure batch window.
- Place the python code for transferring file to S3 inside a container in Azure ADLS or Blob, and specify this linked service and path in the custom activity settings tab. Also provide the command to trigger the Python script as shown below:
- Note that for the data copy script, additional python packages related to azure, aws and boto3 also will be required. In this case, modify the start task command as shown below:
cmd /c “python-3.11.4-amd64.exe /quiet InstallAllUsers=1 PrependPath=1 Include_test=0 && pip install azure-storage-blob && pip install s3fs && pip install pandas && pip install boto3”
- Also, if the above start task modification is made to an existing pool in Azure batch, make sure that the compute nodes are restarted for the changes to take effect.
- Job run status can be tracked from Jobs tab in Azure Batch.
- Error logs if any will be available in stderr file and standard output streams like print statement will be available in stdout file.
Python script
from azure.storage.blob import BlobServiceClient, ContainerClient
import pandas as pd
from io import BytesIO
from io import StringIO
import boto3
print("Reading data from Azure Blob...")
src_storage_accnt = "storage_accnt_name"
src_container = "container_name"
src_file = "source_file_name"
connect_str = "DefaultEndpointsProtocol=https;AccountName=<account_name>;AccountKey=<*********>==;EndpointSuffix=core.windows.net"
blob_service_client = ContainerClient.from_connection_string(conn_str=connect_str, container_name=src_container)
try:
blob_data = blob_service_client.download_blob(src_file)
src_blob = blob_data.readall()
except Exception as e:
print("Exception occurred while reading data from blob")
raise Exception("Data read exception")
print("Data read from Azure Blob successfully...")
print("Writing data to AWS S3...")
tgt_bucket = "target_bucket_name"
tgt_directory = "target_directory_name"
tgt_file = "target_file_name"
tgt_user = "target_user_name"
try:
access_key = "***********"
secret_key = "*************************"
aws_object_key = tgt_directory + "/" + tgt_file
s3 = boto3.client("s3", aws_access_key_id=access_key, aws_secret_access_key=secret_key)
s3.put_object(Body=src_blob, Bucket=tgt_bucket, Key=aws_object_key)
except Exception as e:
print("Exception occurred while writing data to bucket")
raise Exception("Data write exception")
print("Data written to AWS S3 successfully...")