Introduction to Azure Synapse Analytics

Jonathan Bogerd
7 min readFeb 21, 2022

--

In this 5-part series we will train and test a Machine Learning model using Azure Synapse Analytics and Azure Machine Learning. In this first article we will setup a Synapse and Machine Learning workspace and link them together. In the next articles we will extract, load and transform data (ELT), develop a machine learning model using AutoML and score a dataset with the developed model. In the last article we will go through the main advantages and disadvantages of Synapse in combination with Azure Machine Learning for a data scientist.

Setup Synapse

In this section we will go over the steps required to setup the Synapse environment. We will also include how to setup Azure Machine Learning and link the two resources as we will use this in later articles in order to use Synapse for data science purposes. The first thing that needs to be done in order to setup the Azure Synapse Analytics resource is to enable it for the subscription, if this is not done already. To do this, go to the subscription you want to use and open the tab Resource Providers. Here you can search for Microsoft.Synapse and register.

For this article we will combine all resources in a new resource group. To create the resource group, select the Resource Groups tab under Settings. This will list all the resource groups that are present in the selected subscription. To create a new resource group, click on create and fill in the name of the resource group and select the appropriate region. In this article we will name the resource group rg-synapse-article and select (Europe) West-Europe.

In the Tags section, you can fill in tags that you plan to use for your Azure environment. Examples for tags could be the environment or for instance Production. More information on tags can be found here.

After this is done, we will go to Review + create. After the validation is done, click Create to create the resource group. Now, navigate to the resource group we just created by either selecting the resource group from the list or by clicking on To Resource in the top right corner. On the overview page of the resource group, you should see an empty resource group. We will start with creating the synapse workspace. Click on Create in the top left corner, and search for Azure Synapse Analytics (provided under Analytics). This will lead you to a form to create Azure Synapse Analytics. Azure automatically creates a managed resource group for Synapse and will name this resource group automatically. If you want to name it yourself, you can provide the name in the Managed Resource Group field. We will use mrg-synapse-article. The work space name will be syn-workspace-weu01 and the region again West Europe. The Synapse workspace has a primary Data Lake Storage. We will use this storage account for storing data and even storing a machine learning model. We will create a new storage account, by clicking on Create new. Note that this storage account has to be a Data Lake Storage Gen2 account, that is with hierarchical namespaces enabled. We will use dlweu01 as a storage name, as a hyphen is not allowed in naming the storage account. We also need to provide a new container in this storage account by selecting Create new and the container will be named synapse. Automatically, the box for Storage Blob Data Contributor is ticked. We will leave this as it is required.

If you have done this, we go to the next tab Security. Under this tab, we will provide the account details for the Built-in SQL pool of Azure Synapse Analytics. Azure automatically creates the password, and we will leave this as is. In our setup, network access is not required. If you want to use double encryption with your own key, you can enable this setting. For this article, we will not use this.

On the networking tab, you can select virtual network options. For now, we will disable connections from all IP addresses for security purposes.

Now we have provided all necessary fields to validate and then create the workspace. The deployment of our newly created workspace will take a few minutes.

Setup Azure Machine Learning

While the deployment is ongoing, we will create the Azure Machine Learning resource. Go to the resource group, click on Create again and select Machine Learning. For the workspace name, we will use aml-workspace-weu01 and for region, select West Europe. Azure Machine Learning also requires a storage account. However, contrary to Azure Synapse Analytics, it must be a Data Lake Storage Gen1 account, that is with hierarchical namespaces disabled. Therefore we cannot use the same storage account and we will create a new one by selecting Create new. The name will be amldlweu01 and we will use Locally-redundant storage as this is the cheapest option. For production purposes, you might want to select a different redundancy level, depending on you requirements. AML also requires a KeyVault to save the keys for trained models. Select Create new and name the KeyVault kv-weu01. The last resource we will create in this section is Application insights, named appi-weu01. For now, we will not create a container registry. This can be used to deploy a model, however you can create a container registry when the model will be deployed, so currently it is not required to do this. After this is all done, select Review and Create and deploy the AML workspace.

Spark pool and Linked Service

Now all resources that are required are created, it is time to open the Synapse workspace. Before we do this though, we need to add the ClientIP to the allowed IP addresses, by going to Networking and then click Add client IP. Do not forget to save this change. Finally it is time to open the workspace! Click on the Synapse resource, and click Open Synapse Studio. In the next article we will go through the options of Synapse, now we only create a Spark Pool. On the left-hand side, click on Manage and then on Apache Spark pools. In this tab we can create a Spark pool by specifying the name and configurations of the Spark pool. For this article, we create a pool with the name sparkpool01. Currently, only Memory Optimized clusters can be used. Then pick the required size of the cluster. For this example, we will use the smallest spark pool possible and disable Autoscale.

Next the additional settings can be passed. In this tab, you can select pausing and the spark version. For this article, we will be using Spark 2.4 as version 3.1 is not supported by Azure Machine Learning. Note that this version comes with Python 3.6.

We are almost done with the setup now, the only thing that is left is to give Synapse access to both the AML workspace and the AML storage. After that, AML and its storage can be added as a Linked Service. To do this, go to the AML storage, and click Access Control on the left hand side. Click on add role assignment and select the role Storage Blob Data Contributor. Then, on the next screen, click on Select members and search the Synapse workspace by name. Click select and Review + assign. Now, for the AML workspace, do the exact same thing, but instead of the role Storage Blob Data Contributor, select the role Contributor. Now this is done, we can create both the linked services. Do this by clicking Manage in the Synapse workspace. Then, click on Linked Services, new and find AML in the options. We will name the Linked Service LSAML1 (hyphen not allowed). Using managed identity as the authentication method, select the appropriate subscription and workspace. Test the connection to make sure everything works, before creating the linked service. Similarly, create a Linked Service LSDL1 to the storage account of AML, using Managed Identity. In order for these changes to come into effect, we need publish them, by clicking Publish all.

In order to be able to use the Spark pool cluster from Azure Synapse Analytics in Azure Machine Learning, we also have to link the Synapse workspace to AML. To do this, go to the AML workspace and open Linked Services. There, click on add integration to choose the Synapse workspace. For the Name we will use LSSYN01, for the other fields, select the appropriate subscription and workspace. In the next tab, select the Spark pool we just created and give the pool a compute name within AML. When this is done, select create to create the linked service and compute.

This concludes this article in which we created a Synapse and Machine Learning workspace, setup a Spark cluster and linked the two Azure services. Now we are ready to load some data in our Synapse workspace and start the Machine Learning process. This is the topic of the next article.

Sources

--

--

Jonathan Bogerd

Data Scientist. I write about Data Science, Machine Learning and anything related to AI.