No Code Data Enrichment with Azure Synapse and Azure Machine Learning

Image for post
Image for post

TLDR; This post will walk through how to train and evaluate Azure ML AutoML Regressions model on your data using Azure Synapse Analytics Spark and SQL pools.

Before we get started let’s make sure we are all on the same page with the core Azure concepts needed to take your data to the next level.

If you are new to Azure you can get started a free subscription using the link below.

What is Azure Synapse Analytics?

Azure Synapse Analytics is an integrated service that accelerates extracting insightful across data warehouses and big data systems.

Image for post
Image for post

Azure Synapse ties together traditional relational SQL enterprise data warehousing, unstructured data stores and serverless Apache Spark , to enable limitless pipelines for ETL and ELT operations. Furthermore Synapse Studio provides a unified interface for data monitoring, coding, and security.

Image for post
Image for post
Synapse has deep integration with other Azure services such as Power BI, CosmosDB, and AzureML which makes it perfect for wrangling insight out of your data.

What is Azure Machine Learning Auto ML?

Image for post
Image for post
Auto ML, is the process of automating the time consuming, iterative tasks of machine learning model development.

Azure Machine Learning (Azure ML) is a cloud-based service for creating and managing machine learning solutions. It’s designed to help data scientists and machine learning engineers to leverage their existing data processing and model development skills & frameworks.

Image for post
Image for post
What is Azure Machine Learning

In Azure Machine Learning, Auto ML allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality.

With Auto ML we can transform Synapse Analytics Data into actionable baseline models to enrich datasets at scale without writing a single line of machine learning code.

Regression is used to build models to forecast numeric values such as taxi fares based on learned input features.

In the next section, I will walk you through an end to end example of how to enrich you Synapse data by training and evaluating a model with the NYC Taxi Dataset. Once you get preform these steps you’ll be able to train and run your own Auto ML on models on any tabular dataset of your choosing.

Getting Started

Image for post
Image for post
Let’s ingest, train a model and enhance some data to predict taxi fares with the

Step 1: Set Up Azure Machine Learning and Synapse Workspaces

First if you do not have them already we need to create our Azure ML and Synapse workspaces.

Image for post
Image for post
Create Azure ML Workspace from the Portal or use the Azure CLI
A nice video from youtube to walk you through creating an Azure Synapse workspace easy to follow steps from the docs can be found here.

Step 2: Link Azure Machine Learning Workspace to Synapse Service

Once we have deployed our two workspaces we need to link them. Full steps for linking Azure ML and Synapse Workspaces can be found here

Image for post
Image for post
Follow these steps to create a service principle and link the azure ml and machine learning workspaces.

Step 3: Create Dedicated Serverless Apache Spark Spark and SQL Pools

To actually ingest and process our data we need to use pools. Azure Synapse Analytics offers various analytics engines to help you ingest, transform, model, analyze, and serve your data.

For this tutorial since we are using a toy dataset we can use the cheapest pools available for you own data you may want to configure your pools accordingly.

The serverless Apache Spark pools offers open-source big data compute capabilities. This is where the majority of our data processing and Auto ML code will run.

Image for post
Image for post
Steps for creating a Dedicated Spark Pool can be found here

A dedicated serverless SQL pool offers T-SQL based compute and storage capabilities. We will use this pool to store the data we want to enhance with our AutoML model.

Image for post
Image for post
Steps for creating a dedicated sql pool can be found here

One key advantage of Azure Synapse Analytics is if you configure a time out you only pay for the compute when it’s in use.

Step 4: Upload and run the Spark Taxi Data Notebook to create Spark Database and SQL Test Database

Once we have our serverless Spark and SQL pools up and running we can now ingest our data setup our Spark and SQL tables for training and testing respectively.

Download this Spark Create-Spark-Table-NYCTaxi- Data.ipynb notebook and import it into your workspace.

Image for post
Image for post
Image for post
Image for post

Once the notebook is uploaded, change the sql_pool_name value to match the name of your sql pool and then select the desired spark pool and run click all.

Step 5: Launch Auto ML Wizard and Train Regression Model with AutoML using NYC Taxi Spark Table.

Once the data is ingested we can use our spark nyc_taxi and Spark Pool to train an AutoML regression model for forecasting taxi fares.

Image for post
Image for post

Follow the three steps in wizard below to train your model.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Note: Be sure to set the target column to fareAmount and use the onnx model model compatibility option. For a more in depth explanation of the training steps check out the documentation here

This will kick off your Auto ML Regression training job. It should take about 2hrs to run when it is complete we can then evaluate it on our SQL test table.

Image for post
Image for post
Image for post
Image for post

Step 6: Enhance SQL Table with Trained Auto ML Model

Once we have the best model we can now evaluate it on our test SQL table using our SQL Pool.

First we need to select the table we want to enhance with the model we just trained.

Image for post
Image for post

Then we select our new Auto ML model, map the our input table columns to what the model is expecting and choose or create a table for storing our model locally.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

The wizard will generate a T-SQL script that evaluate our model against the test data and outputs the fare predictions.

Image for post
Image for post

There you have it all you need to know to train and test you own AutoML models and make them actionable.

Next Steps

Image for post
Image for post

Additional Synapse Documentation and Walkthroughs worth checking out can be found below:

Now that you’ve finished the steps above it is time to try them out on you own synapse data. Feel free to post in the comments if you have any questions and to share the cool models you make!

For more about Machine Learning and Azure Synapse check out this awesome episode from the AI Show

Look forward to seeing what AutoML and Azure Synapse can do for you!!

Acknowledgments

Thanks to Nellie Gustafsson, Yifan Song and Chang Xu from the Azure Synapse product team for their great documentation and support during the writing this post.

About the Author

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with the Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.

Microsoft Azure

Any language.

Aaron (Ari) Bornstein

Written by

<Microsoft Open Source Engineer> I am an AI enthusiast with a passion for engaging with new technologies, history, and computational medicine.

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Aaron (Ari) Bornstein

Written by

<Microsoft Open Source Engineer> I am an AI enthusiast with a passion for engaging with new technologies, history, and computational medicine.

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store