Ingest Data into Microsoft Fabric OneLake using AzCopy

Objective

Inderjit Rana
Microsoft Azure
7 min readOct 5, 2023

--

Microsoft Fabric is one of the most exciting products for Data & Analytics made available very recently. This particular blog post is to share the instructions on how to use AzCopy utility to ingest (simple world copy) data into Microsoft Fabric OneLake. There are many ways to ingest data into OneLake with their pros and cons (some methods are still work in progress) and AzCopy is one easy to use cross-platform tool available but not very well-document hence topic for this article. I will go over the details on how to use it, the cases where it will not work and what are some good reasons to use it for ingesting data into OneLake.

Scope

The scope here is to address the scenarios where the need is to ingest relatively large size files (could be 100s of GBs or even TBs) into Microsoft Fabric OneLake. I do touch up on scenarios where you have data in Azure Storage as well but other Cloud Stores like Snowflake are out of scope for this article.

Pre-Requisites

  • The solution shared here is for a specific task related to Microsoft Fabric and its assumed you are familiar with Microsoft Fabric and its components hence this is not an introductory article.
  • Fabric Enabled Workspace and Lakehouse where files will get copied in case you want to follow along.

Scenario — Data is On-Premises and needs to be copied to Fabric OneLake

There are quite a few tools which can be used to copy data to OneLake

  • Azure Storage Explorer — This is a GUI tool which has been around a while and very widely used so can get the job done very well but if the need is to script the copy process you would be better off directly using AzCopy. Azure Storage Explore underneath uses AzCopy as well, sharing a little trick that the sample commands listed below were copied from Storage Explorer as shown in the screenshot below (nice way re-engineer the command). You can find more detailed information in the public docs — Integrate OneLake with Azure Storage Explorer — Microsoft Fabric | Microsoft Learn
Azure Storage Explorer — AzCopy Command
Azure Storage Explorer — Copy AzCopy Command
  • OneLake File Explorer — This is another easy to use tool, it integrates with Windows File Explore but relatively new so not a whole lot of historical information, its in Preview as of now and available only on Windows Machines. You can read in more detail in the public dos here — Access Fabric data locally with OneLake file explorer — Microsoft Fabric | Microsoft Learn (do checkout the limitations section).
  • Fabric Portal File Upload — It’s the Fabric Portal web interface which is pretty easy to use but if you are uploading large amounts of data (100s of GBs or TBs) I would prefer other tools.
Fabric Portal File Upload
Fabric Portal File Upload
  • Fabric Data Flow Gen2 with On-Premises Data Gateway — Fabric Data Factory On-Premises Data Gateway (expected to be similar to Self-Hosted Integration Runtime from Azure Data Factory) is on roadmap and will be another option in some time, both are same category so combining them in this bullet. These are advanced connector based tools where a software agent is installed on a Windows Server on-premises and have the capability to connect to a variety of data sources in a secure manner. Usually these tools are used when the need is to run data pipelines on a regular schedule, these are great tools which give you end to end visibility into your data pipelines where data is copied to Fabric and then subsequent steps to process/transform data. The drawback here is that it does require setting up Window Machines on-premises and additional technical skills are needed as well. Read more in the public docs — How to access on-premises data sources in Data Factory — Microsoft Fabric | Microsoft Learn

AzCopy

Now comes AzCopy, which I really like for its simplicity and availability across multiple platforms (Windows, Linux, Mac, etc.) and it has been around for a while and widely used. Some of the reasons to use AzCopy:

  • Proof of Concept where need is to quickly ingest large amount of data into Fabric OneLake.
  • Even in cases where database store is on-premises, it might be a quick solution to export the data to files and then use AzCopy to transfer those files to Fabric OneLake.
  • Need is to have a scripted method to easily repeat the copy tasks.
  • Even in production scenarios, I have seen AzCopy used to transfer data to Azure on a regular schedule (using on-premises Job Schedulers) so some folks might just prefer AzCopy scripts over a more advanced solution.

Couple Reference Links — Copy or move data to Azure Storage by using AzCopy v10 | Microsoft Learn and azcopy copy | Microsoft Learn

How? Sample Commands

Sample commands below show how AzCopy can be used to transfer local files to Fabric Lakehouse Files area, these files can be in any format like CSV, Json, etc. The usual flow in Lakehouse pattern is to upload the raw files to Lakehouse Files area and then use Fabric Spark to create Lakehouse Tables. Lakehouse automatically comes with a SQL Endpoint which gives end users the ability to query data using T-SQL without any interaction with Spark.

Step 1: Capture the Destination Folder Path in Lakehouse where the files need to be uploaded as shown in the following screenshots:

OneLake Destination Properties
OneLake Destination Folder Path
OneLake Destination Folder Path

Step 2: Perform AzCopy login using the command — “./azcopy.exe login” , if you have multiple AAD Tenants you specify Tenant ID using the option “ — tenant-id”

Step 3: Run the Copy Command, samples below

#Copy all files from local folder to Fabric OneLake
./azcopy.exe copy "C:\mytemp\*" "https://onelake.blob.fabric.microsoft.com/<WorkspaceNameXXXX>/corelakehouse1.Lakehouse/Files/mytest1/" -overwrite=prompt -from-to=LocalBlob -blob-type BlockBlob -follow-symlinks -check-length=true -put-md5 -follow-symlinks -disable-auto-decoding=false -recursive -trusted-microsoft-suffixes=onelake.blob.fabric.microsoft.com -log-level=INFO;
#Copy single file from local to Fabric OneLake
./azcopy.exe copy "C:\Users\insinghr\Downloads\employee_handbook.pdf" "https://onelake.blob.fabric.microsoft.com/<WorkspaceNameXXXX>/corelakehouse1.Lakehouse/Files/mytest1/employee_handbook.pdf" -overwrite=prompt -from-to=LocalBlob -blob-type BlockBlob -follow-symlinks -check-length=true -put-md5 -follow-symlinks -disable-auto-decoding=false -recursive -trusted-microsoft-suffixes=onelake.blob.fabric.microsoft.com -log-level=INFO;

Note: As mentioned above commands were reverse engineered from Azure Storage Explorer, some of the specified options might not be required but do no harm, one thing to pay attention to is “ — trusted-microsoft-suffix” option which is required otherwise AzCopy gives an error and this is a key piece of information.

Scenario — Data is in Azure ADLS Gen2 Storage Account and need to be copied to Fabric OneLake

Before going into sample commands its important to point out that Shortcuts for ADLS Gen2 (Storage Account) is a great feature where you don’t have to the copy the data and you can directly consume data residing in Azure Storage from Fabric Compute but there could be valid reasons where need is to consolidate data in Fabric OneLake. In such cases AzCopy can be used to perform a Remote Copy where data is copied directly between ADLS Gen2 Storage Account and Fabric One Lake irrespective of the client where AzCopy is executed (could be outside Azure like a server in your data center or a laptop). You can read more about shortcuts in the public docs — OneLake shortcuts — Microsoft Fabric | Microsoft Learn

Note: Important thing to point out is that the Remote Copy method does not work if your Storage Account is network protected in other words remote copy will only work if Storage Account Network Setting is Allow All Networks. At this point in time, even the Shortcuts do not work with network protected Storage Accounts. Network Configuration is a common source of confusion so for clarity sharing the screenshot on the exact network setting under which Remote Copy or Shortcuts work.

Storage Network Settings
Storage Network Settings

So, the intent here is to use AzCopy for remote copy between Storage Account & Fabric OneLake and the instructions below cover two scenarios (please see the previous section on how to capture destination path and login command as I am not repeating those steps in this section).

Source Storage is in same Azure AD Tenant as Fabric

Since both source and destination are on the same AAD Tenant logged in user credentials are used and there is no need to include any credentials explicitly in the command.

./azcopy.exe copy "https://<StorageXXXX>.dfs.core.windows.net/sampledata/*" "https://onelake.blob.fabric.microsoft.com/<WorkspaceXXXX>/corelakehouse1.Lakehouse/Files/sampledata1" -overwrite=prompt -check-length=true -disable-auto-decoding=false -recursive -trusted-microsoft-suffixes=onelake.blob.fabric.microsoft.com -log-level=INFO;

Source Storage Account is in a different AAD Tenant from Fabric

In this case you will perform login into Fabric AAD Tenant using the AzCopy login command so destination will use AAD credentials but Shared Access Signature needs to be specified for the source as shown in the command below.

./azcopy.exe copy "https://<StorageXXXX>.dfs.core.windows.net/sampledata/*?<sas>" "https://onelake.blob.fabric.microsoft.com/<WorkspaceXXXX>/corelakehouse1.Lakehouse/Files/sampledata1" -overwrite=prompt -check-length=true -disable-auto-decoding=false -recursive -trusted-microsoft-suffixes=onelake.blob.fabric.microsoft.com -log-level=INFO;

Disclaimer: Microsoft Fabric Platform is evolving fast as things change I will try my best to keep things updated so please do pay attention to publish date for this article. Checkout Microsoft Fabric Blog for most recent updates and latest learning material.

Note: Sample Commands in this article were tested using AzCopy Version 10.21.0

--

--