Building a Scalable Data Warehouse in the Cloud with Paas(sion)
Cloud computing is the future in the present. While it seems rather complicated in the context of data warehousing, nothing could be further from the truth!
Platform as a service (PaaS) is a cloud computing model that guarantees the availability of all key elements within one platform, directly in the cloud. From development tools through launching, monitoring, and management, all the way to the analysis and presentation of data, it’s a novel approach that is still gaining popularity, although it sometimes requires a fresh look at concepts that have been proven for years.
When we think about a data warehouse and data analysis on a corporate scale, usually it appears in our head as traditional mechanisms based on on-premise architecture. The four most important platforms (SQL Server, Integration Services, Analysis Services and Reporting Services) have helped us to build great solutions for years. However, they are often not flexible enough to keep up with the needs of companies that are constantly evolving. In the case of servers, scaling-up vertically (adding components) or scaling-out horizontally (adding machines) sometimes becomes very expensive, takes too long, or is even completely impossible to implement or maintain. The decisive factor is therefore the knowledge of alternative solutions that will allow us to achieve similar goals (with know-how and experience gained from well-known mechanisms) but with the help of new, cost-optimized technologies that bring new possibilities on an unprecedented scale.
Their potential was recognized by one of our clients, the largest retailer of books, press, and music in Poland. By entrusting us with the task of creating a complete data warehouse solution in the PaaS approach, they knew that they could count on speed in designing and creating processes, flexibility in performance scaling and cost management, and further development of their platform based on a growing range of Azure cloud capabilities.
An extremely fast data integration process with Integration Runtime and Azure Data Factory version 2
Thanks to the Integration Runtime service (IR), we are able to quickly and easily install software on the client’s side that enables us to securely and efficiently communicate with servers within their infrastructure. Regardless of whether we want to connect to SQL Server database, Oracle, MySQL, PostgreSQL, download the file, or maybe call the REST API, all we need to do is configure the security keys, and we can already connect to them from the Azure services. The list of supported databases and formats, although already extensive, is still growing. In addition, if there is such a need we could even use a specially prepared environment in the cloud that will allow us to launch our Integration Services packages.
Azure Data Factory V2 (ADFv2) is a cloud equivalent of SSIS packages, which enables faster design of data flows and transformations from many sources located both on on-premise servers and in the cloud. Creating, managing, running, and monitoring the so-called “Pipelines” was carried out using a web browser on the dedicated Azure platform.
In comparison with its previous version — v1, preparing solutions is more convenient and intuitive here. The flow creation process is reminiscent of the well-known BIDS or Visual Studio and Integration Services components. We could also find here such functionalities as the use of Azure Databricks, Azure Datalake or even HDInsight for data analytics, Azure Batch Service for own flows or transformations, and finally support for git repositories and the related mechanisms of continuous integration and implementation. ADFv2 is a powerful tool that we can additionally manage using REST API or directly in PowerShell.
Mass parallel processing — SQL Server on steroids
The technology of parallel data processing on many connected servers is not new. But in the case of a product like SQL Server, it can arouse interest, and rightly so! Imagine one non-trivial SQL query that has to return complex aggregates from hundreds of billions of rows. In Azure Data Warehouse, it will first go to the primary server — Control Node, where it will be divided into tasks for 60 smaller databases called distributions, which are connected to a specified number of compute servers known as Compute Nodes. The number of computing units depends on our subscription in Azure, where in the simplest variant (DWU100) one Compute Node will perform tasks on all 60 databases, and in the highest (DW6000) each database will be used for tasks individually by 60 computing units.
In addition, if our requirements are even bigger, we can change the compute model to a performance-optimized model (Compute Optimized Performace Tier) with just a few clicks, where the distribution data is kept on local storage caches and tempdb structures based on extremely fast NVMe SSDs. Their performance in action can be viewed here.
Data Factory v2 and Data Warehouse — an outstanding team!
The advantages of MPP architecture are also used by the ADFv2 mechanisms. PolyBase support allows you to load data directly into distribution databases in parallel, bypassing Control Node. The parameterization logic, based on the JSON format, allows you to easily and quickly create universal solutions. Automatic column and type mapping help you create generic components for loading data from different sources to stage.
In the ELT approach (Extract, Load, Transform), all transformations are carried out at the warehouse, just after loading the data. It gives us complete freedom to manage changes in logic using pure T-SQL code, but also gives incomparably higher performance compared to traditional SQL Server or even Integration Services. The “CREATE TABLE AS SELECT” statement allows us to create tables easily and quickly. We can also directly add and reload partitions without the need for special functions and schemes, all thanks to the simplified syntax. We just need to remember that our data is already divided into 60 parts (after all, we have 60 different distributions), which is important for performance in Columnstore tables. Of course, for smaller sets we can use a traditional clustered index or leave them as heaps.
The entire flow of loading and processing data can be run directly from Azure Data Factory v2. Thanks to error handling, timeouts, and repeat configuration we can quickly and effectively implement logic for unexpected situations.
Azure Analysis Services — tabular in the cloud
If you haven’t had contact with the tabular model yet, don’t worry! Its logic, modeling, and principles of operation are much simpler than what we find in traditional OLAP , so getting to know it is just a matter of time. In the multidimensional approach, we had to deal with the “peculiar” abstraction level of the data cubes, which required time to correct. In the tabular approach, the idea is similar to the relational model, and therefore more intuitive and natural for most of us. What is the secret of the efficiency of such a solution, you may ask? The in-memory VertiPaq engine is responsible for this, with a balanced compression ratio and advanced multithreaded processing algorithms. In addition, if we have the need, our model can be based on DirectQuery, where analytical queries will be run directly on the database, not calculated in AAS.
Azure Analysis Services supports the compatibility of the latest tabular model 1400, so many of the limitations of previous versions are now history. We can manage the model through the TOM library or using TMSL scripts that implement the instructions in the JSON object, making them much easier to write. Thanks to the available methods of processing selected partitions, we do not have to reprocess the entire model every time. The DAX language will allow us to calculate the required columns and measures and download data from the model. Of course, we do not have to limit ourselves to it, as tabular also supports MDX.
The Azure portal now offers several Analysis Services cost plans, divided by purpose and hardware parameters. Each offers specific prices for the time of work of our server, which of course means that we can only pay for actual use. The server can be started and stopped at any time. Similarly, we can always change its performance. The data that we will store in memory is highly compressed, so depending on the data characteristics, we can accommodate up to hundreds of gigabytes of information. In our project, we also used the Azure Automation platform. It is a service that allows us to automate processes using PowerShell or Python. We can create a flexible solutions for running pipelines, process data models, automatically reload partitions, or even implement start or stop other azure services.
Power BI — the cherry on top
Efficient loading into Azure Data Warehouse and a well-prepared data model in Azure Analysis Services is half the battle. There will be no benefits from dry data without the tools to analyze and present them. Fortunately, Power BI works great in this! In our solution, we use the Power BI Embedded version and embed reports directly in SharePoint Online. This gives us an easy way to share data to users directly in their company portal. Naturally, Excel and other desktop solutions are not a problem here. The most important thing is that in each of them, by using our recalculated in-memory tabular model in Azure all of our interactions, intersections, filtering, etc. will be done in the cloud, very quickly and on sets that have already exceeded billions of rows a long time ago!
Learn it. Know it. Done.