From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 3: Design and Creation)

11 min readApr 21, 2024

This blogpost is a continuation of, From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 2: Data Modeling and Schema Design)

Problem Context:

In the previous articles, From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 1: Architecture), From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 2: Data Modeling and Schema Design), the mentioned scenario was to build data warehouse specially tailored for companies focusing on retail operations. Data warehouses are the foundational components for the most outcome-driven technical and analytical functionalities like Data Analyses, Machine Learning Integration, Predictive/Descriptive Analysis etc.

Now, let’s implement the solution with Azure data services. It’s going to be a large post, but it’d be worth it. Hang in there.

Data Sources and Data Ingestion:

Data Warehouse Architecture for Small and Medium Businesses (Diagram: Author)

The data sources mentioned in the architecture (discussed thoroughly in the {{{{first part}}}} of this series) include the following:

Legacy data warehouses or other cloud data warehouses: Data is ingested from these sources.
SQL servers: Data is migrated or partially refactored from legacy databases and SSIS packages. And other cloud servers.
Real-time data from streaming sources: This enters the system via Azure Event Hubs and is immediately analyzed using Azure Stream Analytics.

Unstructured, semi-structured, and structured data: They are passed into Azure Data Lake Storage Gen 2 for centralized storage and analysis.

The data from the above sources are ingested into a data lake storage (Azure Data Lake Storage Gen 2). Unstructured, semi-structured, and structured data are passed into Azure Data Lake Storage Gen2, which serves as the centralized storage and analysis platform. Real-time data streams enter via Azure Event Hubs and are immediately analyzed using Azure Stream Analytics.

The entire ecosystem is tightly integrated with other Azure services, ensuring seamless data flow and accessibility for potential consumers. Throughout this process, Microsoft Entra Managed Identities play a crucial role in maintaining security and managing roles within the pipeline.

Data Modeling and Schema Design: Recap

See “From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 2: Data Modeling and Schema Design)” to get the detailed understanding around the data model within the data warehouse. Understanding the data model is crucial in terms of creating the tables and relationships for it.

Data Model mentioned in “***From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part 2: Data Modeling and Schema Design)”*** (diagram: Author)

Setting Up the Azure Synapse Analytics Workspace:

Set up the Repository:

All the automation codes, SQL codes for queries, ARM templates will be found in the following repository,

GitHub - tahhnik/Designing-Large-Scale-Data-Warehouse-with-Azure-Synapse-Analytics: The repo…

The repo simulates query load, optimizes performance, and offers practical guidance for building data warehouses with…

github.com

Prerequisites:

An active Azure subscription in which you have administrative-level access
Knowledge of Azure Data Fundamentals, you can begin from here.

Provision an Azure Synapse Analytics workspace:

There are two ways to provision an Azure Synapse Workspace,

1. Using the UI to instantiate Azure Synapse Analytics workspace withing a resource group through the Azure portal. You can just go to Azure portal and provision the Synapse Analytics workspace with some clicks.
2. Using a combination of a PowerShell script and an ARM template to provision Azure Synapse Analytics. [9]

ARM (Azure Resource Manager) template is a block of code that defines the infrastructure and configuration for your project. These templates use a declarative syntax to let you define your deployment in the form of JSON (JavaScript Object Notation) files. For this context, a combination of a PowerShell Script and an ARM template is used to provision an Azure Synapse Analytics workspace within a new Azure resource group. Because in most of the production scenarios, it’s the best practice to automate provisioning the resources and workspaces with scripts and templates so that the users can incorporate resource deployment into a repeatable development and operations (DevOps) process.

Running the PowerShell Command on Azure Portal:

Open the Azure Portal at at https://portal.azure.com
Create a new Cloud Shell (click on the [>_] button above) in the Azure portal, selecting a PowerShell environment and creating storage if prompted. The cloud shell provides a command line interface in a pane at the bottom of the Azure portal:

3. Clone the repository mentioned above by entering the following command in PowerShell pane,

git clone https://github.com/tahhnik/Designing-Large-Scale-Data-Warehouse-with-Azure-Synapse-Analytics.git dw-capstone-project

4. After the process of cloning the repo has been finished, enter the following commands to change to the directory where all the scripts are and run the setup.ps1 script it contains:

 cd dw-capstone-project
 ./setup.ps1

5. When asked, select the subscription you wish to use. This occurs only if you have access to multiple Azure subscriptions.

6. A prompt would appear for setting a password for the Azure Synapse SQL Pool, enter a suitable password.

7. Please be patient while the script finishes executing. Usually, this process takes approximately 15 minutes, although occasionally it may take longer.

Provisioning the Synapse Workspace with the ARM Template

Creating A Dedicated SQL Pool:

After the script has completed, follow these steps:
1. Open the Azure portal and navigate to the resource group named abisqits-xxxxxxx.

2. Select your Synapse workspace from the list of resources.

3. In the Synapse workspace’s Overview page, click on the Open Synapse Studio card and select Open to launch Synapse Studio in a new browser tab.

4. Once Synapse Studio has loaded, expand the menu on the left side using the ›› icon.

5. Navigate to the Manage page and select the SQL pools tab.

6. Select the sqlxxxxxxx dedicated SQL pool and use its ▷ icon to start it, confirming that you want to resume it when prompted.

7. Wait for the SQL pool to resume, checking its status periodically using the ↻ Refresh button. The status will show as Online when it is ready.

Creating The Warehouse Tables with Dedicated SQL Pool:

Once you’ve set up a dedicated SQL pool, you have the ability to manage its operational status within the Manage section of Synapse Studio. When it’s not needed, you can pause the pool to avoid incurring unnecessary expenses. Additionally, while the pool is active, you can delve into it on the Data page and even create SQL scripts to execute within it.

If you want to directly create the tables from powershell and head to the querying part, uncomment the following codeblock in setup.ps1 file and run it. It would create the tables and load the data into them. You can just focus on querying the data warehouse. The third and final part of this series discusses the querying processes. You can directly jump onto that post and start querying.

# Create database
write-host "Creating the $sqlDatabaseName database..."
sqlcmd -S "$synapseWorkspace.sql.azuresynapse.net" -U $sqlUser -P $sqlPassword -d $sqlDatabaseName -I -i setup.sql

The script is going to create the tables mentioned in the setup.sql file (which are the facts and dimensions mentioned earlier and you can directly access them from Synapse Analytics workspace.

We are going use the dedicated SQL pool which we have just created in order to load and store the data, query the tables to generate Azure Synapse Analytics offers limitless scale, security, privacy, and a unified experience for data engineering, data science, and business analysis. In this article, you’ll observe how to use a dedicated SQL pool in Azure Synapse Analytics to store and query data in a relational data warehouse.

Creating the Dimension Tables:

Now, we’re going to create all the dimension tables with their respective attributes mentioned in the schema. Make sure that the dedicated SQL pool has been resumed. Go to the Data pane in the workspace and select the SQL database with the same name as the dedicated SQL pool sqlxxxxxxx.

Now, right click on the database and select the “New SQL Script” as Empty script.

Creating Dimension Table: DimAccount

When creating a dimension table, we have to ensure that it includes surrogate and alternate keys, along with columns for the relevant dimension attributes. Using an IDENTITY column simplifies generating an incremental surrogate key.

Now, write the following code and run it.

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[DimAccount]
( 
 [AccountKey] [int]  NOT NULL,
 [ParentAccountKey] [int]  NULL,
 [AccountCodeAlternateKey] [int]  NULL,
 [ParentAccountCodeAlternateKey] [int]  NULL,
 [AccountDescription] [nvarchar](50)  NULL,
 [AccountType] [nvarchar](50)  NULL,
 [Operator] [nvarchar](50)  NULL,
 [CustomMembers] [nvarchar](300)  NULL,
 [ValueType] [nvarchar](50)  NULL,
 [CustomMemberOptions] [nvarchar](200)  NULL
)
WITH
(
 DISTRIBUTION = ROUND_ROBIN,
 CLUSTERED COLUMNSTORE INDEX
)
GO

The table DimAccount would be created, and the attributes would be shown on the “Tables” pane of the database sqlxxxxxxx.

Creating Dimension Table: DimAccount

When implementing a snowflake schema, where dimension tables are interconnected, it’s essential to incorporate the parent dimension key into the definition of the child dimension table. This ensures proper relationships. Now, let’s observe in the case of DimCustomer Table. The following SQL code is used to move the geographical details from the DimCustomer table to a separate table named DimGeography,

CREATE TABLE dbo.DimGeography
(
    GeographyKey INT IDENTITY NOT NULL,
    GeographyAlternateKey NVARCHAR(10) NULL,
    StreetAddress NVARCHAR(100),
    City NVARCHAR(20),
    PostalCode NVARCHAR(10),
    CountryRegion NVARCHAR(20)
)
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE dbo.DimCustomer
(
    CustomerKey INT IDENTITY NOT NULL,
    CustomerAlternateKey NVARCHAR(15) NULL,
    GeographyKey INT NULL,
    CustomerName NVARCHAR(80) NOT NULL,
    EmailAddress NVARCHAR(50) NULL,
    Phone NVARCHAR(25) NULL
)
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

Creating Time Dimension Tables

Almost every data warehouses include a time dimension table. This table enable the users or consumers to aggregate data by multiple hierarchical levels of time interval which means the analyses can be done in more granularity. The following SQL code is used to create time dimension tables:

CREATE TABLE dbo.DimDate
( 
    DateKey INT NOT NULL,
    DateAltKey DATETIME NOT NULL,
    DayOfMonth INT NOT NULL,
    DayOfWeek INT NOT NULL,
    DayName NVARCHAR(15) NOT NULL,
    MonthOfYear INT NOT NULL,
    MonthName NVARCHAR(15) NOT NULL,
    CalendarQuarter INT  NOT NULL,
    CalendarYear INT NOT NULL,
    FiscalQuarter INT NOT NULL,
    FiscalYear INT NOT NULL
)
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

We have to do the same for other dimension tables, DimCurrency, DimDepartmentGroup, DimEmployee, DimOrganization, DimProduct, DimProductCategory, DimProductSubCategory, DimPromotion, DimReseller, and DimSalesTerritory. The reference for the attributes are mentioned in the previous blogpost where the schema for the data warehouse was designed.

Creating the Fact Tables:

As fact tables are the repositories for all the numerical records for a certain operation, transaction of entity, their attributes shoul be as vast as possible. I have already discussed the attributes of the two fact tables in this context, and the code to create them are as follows,

Creating Fact Table: FactInternetSales

CREATE TABLE [dbo].[FactInternetSales]
(
    SalesOrderNumber SQLNCHAR(40) COLLATE SQL_Latin1_General_CP1_CI_AS,
    SalesOrderLineNumber SQLNCHAR(10),
    CustomerKey SQLNCHAR(24),
    ProductKey SQLNCHAR(24),
    OrderDateKey SQLNCHAR(24),
    DueDateKey SQLNCHAR(24),
    ShipDateKey SQLNCHAR(24),
    PromotionKey SQLNCHAR(24),
    CurrencyKey SQLNCHAR(24),
    SalesTerritoryKey SQLNCHAR(24),
    OrderQuantity SQLNCHAR(14),
    UnitPrice SQLNCHAR(60),
    ExtendedAmount SQLNCHAR(60),
    UnitPriceDiscountPct SQLNCHAR(82),
    DiscountAmount SQLNCHAR(60),
    ProductStandardCost SQLNCHAR(60),
    TotalProductCost SQLNCHAR(60),
    SalesAmount SQLNCHAR(60),
    TaxAmount SQLNCHAR(60),
    FreightAmount SQLNCHAR(60),
    CarrierTrackingNumber SQLNCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS,
    CustomerPONumber SQLNCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS,
    RevisionNumber SQLNCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS
)
WITH
(
    DISTRIBUTION = HASH(OrderNumber),
    CLUSTERED COLUMNSTORE INDEX
);

Creating Fact Table: FactResellerSales

CREATE TABLE [dbo].[FactResellerSales]
(
    SalesOrderNumber SQLNCHAR(40) COLLATE SQL_Latin1_General_CP1_CI_AS,
    SalesOrderLineNumber SQLNCHAR(10),
    ResellerKey SQLNCHAR(24),
    ProductKey SQLNCHAR(24),
    OrderDateKey SQLNCHAR(24),
    DueDateKey SQLNCHAR(24),
    ShipDateKey SQLNCHAR(24),
    EmployeeKey SQLNCHAR(24),
    PromotionKey SQLNCHAR(24),
    CurrencyKey SQLNCHAR(24),
    SalesTerritoryKey SQLNCHAR(24),
    OrderQuantity SQLNCHAR(14),
    UnitPrice SQLNCHAR(60),
    ExtendedAmount SQLNCHAR(60),
    UnitPriceDiscountPct SQLNCHAR(82),
    DiscountAmount SQLNCHAR(60),
    ProductStandardCost SQLNCHAR(60),
    TotalProductCost SQLNCHAR(60),
    SalesAmount SQLNCHAR(60),
    TaxAmount SQLNCHAR(60),
    FreightAmount SQLNCHAR(60),
    CarrierTrackingNumber SQLNCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS,
    CustomerPONumber SQLNCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS,
    RevisionNumber SQLNCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS
)
WITH
(
    DISTRIBUTION = HASH(OrderNumber),
    CLUSTERED COLUMNSTORE INDEX
);

Running these SQL codes in the dedicated SQL pool would create the fact tables. The following GIF shows the state of the workspace after all the contextual fact and dimension tables are created. Now, as all the tables are created, they are ready for loading the data from data lake storage into those tables.

For the sake of not writing a full-fledged technical book and keeping the “ideal size” of a blogpost, I am not mentioning those codeblocks in this post as they are very similar and redundant in some manner. Creation of the remaining tables can be some good practical tasks.

Loading Data Warehouse Tables:

Fundamentally, populating a data warehouse involves incorporating fresh data from data lake files into tables within the data warehouse. [3,4] The COPY statement serves as an efficient method to achieve this, as demonstrated in this context. Here, an example of loading data into DimAccount table from the files (which were ingested from multiple source in the context of simulation) stored in Azure Data Lake Storage Gen 2:

COPY INTO dbo.StageProducts
    (ProductID, ProductName, ProductCategory, Color, Size, ListPrice, Discontinued)
FROM 'https://mydatalake.blob.core.windows.net/data/stagedfiles/products/*.txt'
WITH
(
    FILE_TYPE = 'txt',
    MAXERRORS = 0,
    IDENTITY_INSERT = 'OFF'
);

Remember to put all the attributes in the COPY clause. The process should be repeated for all the fact and dimension tables.

Now, the data warehouse is ready and loaded to serve data to the consumers such as Data Analysts, Data Scientists, Machine Learning Engineers to meet the need of fused datasets, leveraging platforms like Azure Machine Learning, Power Apps, Azure Logic Apps, Azure Functions apps, and Azure App Service web apps or other web applications.

Though, Synapse Analytics provides all the necessary tools and capabilities to analyze datasets and generate insights such as serverless SQL pool, dedicated SQL pool, synapse spark pool and integrated power BI functionalities. These tools work as very supportive components to process and analyze very large volumes of data.

So, how to query and generate insights in Azure Synapse Analytics?

The next part of this blogpost discusses about the querying and analysis processes are discussed and implemented thoroughly with Azure data services.

Please follow the next part of this article: “From Design to Deployment: Data Warehousing with Azure Synapse Analytics (Part Four: Querying the Data Warehouse)” where I attempted to simulate the process of querying and analyzing data from a data warehouse on Azure Synapse Analytics for a retail company.