Efficient Data Copying from SQL Server to Azure Data Lake Storage Gen2: Stage-1

Rahul Gosavi
4 min readJul 16, 2024

--

Introduction

In today’s data-driven world, businesses often face the challenge of efficiently managing and storing large volumes of data while ensuring data security and accessibility. Azure Data Lake Storage Gen2 (ADLS Gen2) offers a robust solution for storing and analyzing structured and unstructured data at scale. In this guide, we’ll explore how to seamlessly copy data from multiple SQL Server tables into separate files stored in distinct folders on ADLS Gen2, leveraging the power of Azure cloud services.

Prerequisites

Before diving into the steps, ensure you have the following prerequisites set up:

  • Azure subscription with access to Azure portal.
  • Basic understanding of SQL queries and Azure services.
  • Azure Data Factory provisioned in your Azure environment.
  • Azure SQL Database containing the source tables.
  • Azure Data Lake Storage Gen2 account created for storing the destination files.

Step-by-Step Guide

1. Setting Up Azure Data Lake Storage Gen2

Firstly, create an Azure Data Lake Storage Gen2 account through the Azure portal if you haven’t already done so. This account will serve as the destination for our data.

2. Designing the Solution

To efficiently copy data from SQL Server database to ADLS Gen2, we’ll follow these key steps:

2.1 Creating the Metadata Table

Begin by creating a SQL database table (CopyMetadata) that stores essential information about each source table and its corresponding destination path on ADLS Gen2. Here’s an example of the SQL script to create this table:

CREATE TABLE CopyMetadata (
Table_Name NVARCHAR(128),
Source_SQL NVARCHAR(MAX),
Destination_Path NVARCHAR(MAX),
File_Format NVARCHAR(10)
);

2.2 Populating the Metadata Table

Insert records into the CopyMetadata table with details about each source table and its desired output format (e.g., CSV, Parquet, JSON). For example:

-- I had some tables in my sql server which I am loading it through below query

INSERT INTO CopyMetadata (Table_Name, Source_SQL, Destination_Path, File_Format)
SELECT TABLE_NAME, CONCAT('SELECT * FROM ',TABLE_SCHEMA,'.', 'TABLE_NAME'), CONCAT(TABLE_SCHEMA, '/', TABLE_NAME), 'CSV'
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'DBO'

-- OR, you can refer below one for your reference

INSERT INTO CopyMetadata (Table_Name, Source_SQL, Destination_Path, File_Format)
VALUES
('Table1', 'SELECT * FROM Table1', '/data/Table1/', 'CSV'),
('Table2', 'SELECT * FROM Table2', '/data/Table2/', 'PARQUET'),
-- Repeat for each table you want to copy
('Table10', 'SELECT * FROM Table10', '/data/Table10/', 'JSON');

3. Implementing the Data Copy Process

3.1 Define Linked Services

Configure linked services in Azure Data Factory for both Azure SQL Database (source) and Azure Data Lake Storage Gen2 (destination). Linked services establish connections to your data sources and destinations.

Azure Database Linked Service
ADLS Gen2 Linked Service

3.2 Create Datasets

Create datasets in Azure Data Factory to represent your source SQL Server tables and define the file storage structure in ADLS Gen2. Each dataset should specify the schema and properties of the data to be copied.

Azure SQL Database Dataset for source
Delimited Text dataset for Destination

3.3 Design Copy Pipeline

Design a copy pipeline in Azure Data Factory to orchestrate the data movement process:

  • Activities: Use copy activity to specify the source dataset (SQL Server table) and the destination dataset (ADLS Gen2 file).
Source Snapshot
Sink Snapshot

3.4 Dynamic Content with Parameters

Utilize parameters in Azure Data Factory to dynamically generate file paths and names for each copied file. This allows for flexible management and organization of data in ADLS Gen2 based on your business requirements.

3.5 Create Azure Data Factory Pipeline

Create an Azure Data Factory pipeline that incorporates the copy activities designed in the previous step. Use the ADF visual interface or write JSON scripts to define the pipeline structure.

Output in landing5 container

Please have a look at approach 2 here.

Conclusion

In this guide, we’ve explored the efficient process of copying data from SQL Server tables into separate files stored in distinct folders on Azure Data Lake Storage Gen2 using Azure Data Factory. By leveraging Azure cloud services and robust data integration capabilities offered by ADF, organizations can streamline their data workflows, ensure scalability, and unlock valuable insights from their data stored in ADLS Gen2.

--

--