Tutorial: Building a Solid Foundation for your Cloud Analytics Platform
This hands-on tutorial accompanies an article outlining the foundational principles and best practices of cloud analytics platforms. The content included in this piece is built out in Azure and is meant to provide a simplified approach to both the DevOps basics and data services common to many cloud data architectures. Read the main article, which contains far more detailed information on the fundamentals of a modern cloud data architecture, here.
What’s Inside
This lab covers the basics of developing a cloud platform as learned through helping various organizations in their efforts. The components included are highly common, regardless of the entity’s cloud expertise. I have observed them as the core of high-cloud-maturity companies, and I have witnessed them benefit the people and processes of organizations who are brand-new to the cloud.
Additionally, while this article leverages Azure services, the components are so fundamental that equivalent services would be used regardless of cloud provider.
Modern Cloud Analytics Platform in 3 Parts
This section is covered in-depth in the main article (linked above). For this piece, the information included is primarily relevant to the tutorial.
We outline the following architectural components:
- Storage Layer
- Model Layer
- ETL Layer
Part 1: Storage Layer
A data lake serves as the storage layer for a modern data architecture and in cloud data ecosystems, these are highly secure, limitlessly scalable, and economically-attractive.
Data Lake Services
Azure Data Lake Storage (ADLS) Gen2 is the recommended storage layer when working in Azure. It is built on top of blob storage, with additional functionality such as enabling a hierarchical namespace with which to structure your data lake.
Part 2: Modeling Layer
The (enterprise) data model lives in the data warehouse; this is what enables business and operational processes to run.
Data Warehousing Services
Microsoft is making a big push towards Azure Synapse Analytics. Synapse appears to be the intended one-stop shop for all things cloud analytics, so it’s only natural that Synapse Dedicated SQL Pool is the warehousing solution offered by Azure. In fact, Synapse Dedicated SQL Pool was formerly Azure SQL DW. Synapse also has a Serverless SQL Pool offering, but this would be more recommended for ad-hoc analytics on your enterprise model data as opposed to serving as your (as the name suggests) dedicated warehouse engine.
Part 3: ETL Layer
Modern data architectures are built on top of ETL/ELT processes to get data to where it needs to be, and in the format in which it needs to be there. Orchestration of these pipelines can sometimes be left to the pipeline services themselves as well, such as running these processes on a set schedule.
ETL Services
Microsoft offers three main options for ETL/ELT: Azure Data Factory (ADF), Synapse Pipelines, and Databricks. Each of these services has its own pros and cons as far as technical attributes, cost, usability and maintenance are concerned. ADF and Synapse Pipelines are both Microsoft-built and owned, and have largely overlapping features sets with some discrepancies between the two. For this tutorial, Synapse Pipelines are used as Synapse is already being leveraged for the Data Warehousing service. This decision simplifies the tutorial, but it does not necessarily mean Synapse Pipelines is the correct ETL service for every data architecture. I cannot recommend enough the importance of performing a detailed analysis and comparison among all potential services in the context of your use case.
Let’s get started
This tutorial builds out a basic cloud analytics platform from a simulated on-premise environment into Azure. Note that we are not leveraging configuration-driven pipelines or infrastructure as code (both of which are strongly recommended for your organization’s cloud analytics platform) here. Also, this tutorial does not outline security and governance best practices; please follow your organization’s guidelines when creating any new resources.
Prerequisite: Azure DevOps project already set up and ready to be configured with Synapse Workspace (once we’ve created it, of course).
Infrastructure Setup
- Create a resource group in which to store everything you’re about to build.
- Create a virtual network. If you’re ever working across different virtual networks which you need to connect, make sure you don’t have overlapping IP address spaces. You can use a website like this one to check: https://network00.com/NetworkTools/IPv4CheckOverlappingNetworks/
- Create a VPN Gateway. In this case, because I’m simulating the on-premise environment with another Azure resource group, we could use vNet peering but trying to simulate true on-premise-to-cloud connectivity.
- Create a connection from the cloud VPN Gateway to the on-premise network’s gateway (you’ll need one installed in order to create this link), once both VPN Gateways are in the available state. This task cannot be performed while the Gateways are updating; they may be in the updating state for a while.
- In order to connect to a non-Azure native data source, you’ll need to install a self-hosted integration runtime (IR)on any VM in close networking proximity (for example, the same subnet) to your data sources. You can install the Microsoft self-hosted IR by first downloading it from this website onto your VM: https://www.microsoft.com/en-us/download/details.aspx?id=39717. Once downloaded, follow the setup wizard. You will need to register the self-hosted IR however this is only possible once the Synapse Workspace has been created and a few more steps have been followed, so we’ll come back to this in a moment.
- A Storage Account is necessary not only for your data lake, but also to store metadata for your Synapse Workspace. In order to take advantage of Azure Data Lake Gen2 Storage, you’ll need to enable the hierarchical namespace option. You’ll also probably want to whitelist your client IP under the “Networking” options.
- Next, navigate to Synapse Analytics and create a new Synapse Workspace, leveraging the Storage Account created in the previous step for your file system. This is another service in which you’ll want to whitelist your client IP under “Networking” options. You will also want to ensure that both your user and the Synapse Workspace have sufficient permissions to interact with the Storage Account (AKA: the data lake).
- Finally, you can now set up and register the self-hosted IR by first navigating to the Synapse Workspace and setting up a self-hosted IR. In the final step of this process, Synapse will present two keys; copy one of these and paste it into the appropriate area for registering the self-hosted IR on the host VM. You should see confirmation on both sides (the on-prem VM and the Synapse Workspace) that the registration has been successful.
- One additional note: if you want to land source data in certain data types, such as Parquet, you will need the Java Runtime Environment installed on the host VM. This is a common issue which you can get ahead of by following the instructions in this documentation: https://docs.microsoft.com/en-us/troublseeshoot/azure/general/error-run-copy-activity-azure.
- You will also want to set up Linked Services for the on-prem database and the data lake, as well as any other systems you need to move data in and/or out of. Additionally, you’ll need to set up a Private Endpoint to the Storage Account in which your data lake resides. This is all doable via the Synapse Studio interface, by clicking on the “Manage” (toolbox) icon on the left-hand side.
Now that your minimum viable infrastructure is set up, we can look at building some analytics into this cloud platform.
Sample Data Pipeline
Ok, we’re not actually going to do much analytics here but we are going to show you how to enable analytics in your organization using Azure Data Lake Storage (ADLS) Gen2 and Synapse Analytics. We’ll also use Synapse Pipelines as the ETL service.
Let’s start at the data lake. Navigate to your Storage Account, and then select “Containers” from the left-hand panel. There is a “button” with a plus beside the word “Container” at the top of the list of containers in the account. Create containers for the zones you want to leverage. For the purposes of this article, we’ll have a Raw Zone and a Standard Zone.
Next, you can start creating a pipeline in the Synapse Studio. Select the “Integrate” (pipe) icon on the left-hand side. Then, select the plus icon in the left-hand menu and select “Pipeline” from the dropdown menu.
The main items in Synapse Pipelines are called “Activities.” An Activities menu will appear, in which you can search for the activities you want to add to your pipeline. One of the most common activities is the “Copy data” activity, in which data is moved from a source to a sink. Both the source and the sink need to be configured to point to the correct systems, and the sink format (ie: Parquet) can also be specified. In this demo, we use a Copy data activity to bring data from the on-premise SQL Server into the Raw Zone via a self-hosted IR.
Another commonly-used activity is called a “data flow.” Data flows are a powerful activity type, and actually run on Spark under the hood (and therefore are the most expensive activity type). They enable us to perform actions we can’t do with other activities, such as converting to and from Delta format, as we require Spark for this. For this example, we use two data flow activities. One to move data from the Raw Zone to the Standard Zone, and another to move data from the Standard Zone into the Dedicated SQL Pool. We need to use data flows here because we are working with Delta format. Additionally, between Raw and Standard, part of our data flow activity will involve adding a “last updated” timestamp column to our data.
Technically, our pipeline could be complete with the above activities. However, in an effort to demonstrate the integration of Notebooks into pipelines, we’ll also add a Notebook to our activities. This Notebook will truncate and reload a managed Spark table (using Delta format) in the default Synapse Workspace storage with data from the Raw Zone.
In order to have increased visibility into your pipelines operations as it runs, turn on “Data flow debug”. Then select “Debug” to run your pipeline. For more information on debugging your Azure ETL pipelines, check out this article:
Once I kick off my pipeline (using the integration runtime option), I can follow my pipeline’s activities as they occur. The activity’s status will start off as “queued” and then change to “in progress” before ending up as “succeeded” or “failed”.
Before source_to_raw was kicked off, the Raw Zone of our data lake was empty. Upon completion of the activity, we can see the data landed in our Storage Account in Parquet format.
The next activity is the parquet_to_delta data flow. Here, the source is the data in the Raw Zone in Parquet format. Before converting to Delta and landing the data in the Standard Zone, we add a column called “lastupdated”, populated with the current timestamp.
In addition to indicating the success of our activity, the above screenshot also provides an indication of what delta format data looks like in a data lake.
The next step in our pipeline is the spark_tables Notebook. While this step is redundant and unnecessary from a functional pipeline perspective, it provides an example of how we can leverage Notebooks as part of our Synapse Pipelines.
In order to run Notebooks containing Spark code on an ad-hoc basis, it is necessary to create a Spark Pool. In the above screenshot, you can see the words “Attach to” beside a dropdown menu containing the value “sparkpool”. Spark pools can be created in the Synapse Studio via the Manage tab. Because Spark pools are serverless, you only pay for your Spark pool while it is running. However, that does not mean Spark pools are inexpensive; quite the opposite, in fact. It is important to avoid runaway costs by limiting factors such as the node size and the maximum number of nodes available to your Spark pool, especially for non-production workloads.
With all this being said, when running pipelines, the under-the-hood Synapse Spark engine is leveraged. This is similar to the execution of Data flows. Upon completion of our Notebook activity, a table called “star_classification” will appear under the “default” database in the Synapse datastore. This table will also have a “lastupdated” column added to it, as per the code in the Notebook screenshot above.
The final step of the pipeline involves copying data from the Standard Zone, in Delta format, to the Dedicated SQL Pool. This is achieved via a Data flow, as we require the Spark engine to work with Delta format. Before this activity is initiated, there are no tables in the “cloud_synapse_dw” database.
Upon completion of the activity, we can query the data in our Dedicated SQL Pool. We can also gain confidence that we have the most up-to-date values loaded by leveraging the “lastupdated” timestamp, which was updated in the “delta_to_dw” activity.
Fortunately, Synapse makes it easy for us to query the other data landed as part of this pipeline run via Notebooks. By navigating to “Data” (database icon) on the left-hand side, you can leverage auto-generated queries which use Serverless SQL Pools on data in your Linked Services.
Expanding the data lake Linked Service and select “raw.” Then, navigate to the Parquet file we landed as part of the pipeline and right-click on the file. A dropdown menu of options will appear; to generate the below screenshot, we selected “New SQL script” and then the “Select top 100 rows” option. Remember to select “Run”! Also worth noting is that there is no “last updated” column.
We can also query the Delta format data we landed in the Standard Zone by following an almost identical set of steps. However, instead of navigating all the way to the Delta Lake files themselves (as we did for Parquet), we only need to navigate to the namespace containing those files and then right-clicking on that value. This is indicated by the URL shown in the screenshot below. Here, we have a “last_updated” column.
Finally, to query the data in our managed table created by the Notebook activity in the pipeline, we actually want to use a Notebook attached to a Spark pool. We can still leverage auto-generated code for this. Instead of navigating to “Linked” Data, navigate to the table under “Workspace” data. Right click on the table name, hover over “New Notebook” and select “Load to DataFrame.” The result of these steps (and then running the Notebook) is shown below.
And there you have it! A demonstration of the basic components of a cloud analytics platform. The data used in this tutorial is the “Stellar Classification Dataset — SDSS17” provided by fedesoriano on Kaggle. You can access it here:
Again, this demo covers the fundamentals as well as a way to quickly spin up a POC for the core of your organization’s cloud analytics. With these components in place, it is possible to add all sorts of features, ranging from a configuration-driven pipeline to streaming data services and even machine learning operations.
It all starts out with a solid foundation.