Modern Data Warehouse on Azure

Erim YAGCI
2 min readNov 10, 2021

--

Microsoft’s services are easy to use and its applications are always improving in terms of both functionality and usability. Unfortunately, this continous improvement causes, technical documentation on web to become frequently out of date. Regardless of how big or small an organization is, it can make use of Azure.

Related Products and services on Azure

Azure Blob Storage: Original format data store; provides scalable, cost-effective cloud storage for all data and a wide variety of services for connecting and using the data. Blob storage is commonly used as a staging area for the source data before loading it into Azure Synapse.

Azure Synapse: Azure Synapse is a distributed system designed to perform analytics on large data. It supports massive parallel processing, which makes it suitable for running high-performance analytics.

Azure Data Factory: Data Orchestration Service, extracts data and publishes data from/to multiple data sources (a wide range of data source and target connectors and file formats supported). Provides a graphical interface to build, monitor, and manage data pipelines. It coordinates the various stages of the data transportation process.

Microsoft PowerBI : Interactive Data Visualization, enables visual exploration and analyzing of data all in one view. Provides a way to collaborate on and share customized dashboards and interactive reports.

Data Pipeline : A pipeline is a logical grouping of activities used to coordinate a task — in this case, loading and transforming data into Azure Synapse.

Reference Architecture

Features that are important for enterprise data warehousing scenarios are:

  • Automation of the pipeline
  • Incremental loading
  • Integrating multiple data sources

When you run an automated ETL or ELT process, it’s most efficient to load only the data that changed since the previous run which is called an incremental load. To perform an incremental load, you need a way to identify which data has changed. The most common approach is to use a high water mark value, which means tracking the latest value of some column in the source table, either a datetime column or a unique integer column.

It is important to create separate resource groups for production, development, and test environments. Separate resource groups make it easier to manage deployments, delete test deployments, and assign access rights.

And finally, you can use Azure Monitor for analyzing the performance of data warehouse and the entire Azure analytics platform for an integrated monitoring experience.

Bonus: Find architecture diagrams and technology descriptions for reference architectures, real world examples of cloud architectureson Azure here.

--

--

Erim YAGCI

Translate business problems into data and AI products. Coaching fantastic teams of data engineers, data scientists, analysts and operations teams