Get started with Azure SQL in Databricks

- featuring pandas

Published in

CodeX

6 min readOct 5, 2021

This tutorial will go through how to read and write data to/from Azure SQL Database using pandas in Databricks. If you want to learn the basics of Databricks, you can check out this post.

This is part 2 in a series about Databricks:

Get Started with Pandas in Databricks
Get started with Azure SQL in Databricks
Get Started with Azure Blobs in Databricks

Prerequisites

Some familiarity with python pandas
An instance of Databricks — preferably via Azure
An instance of Azure SQL Database. Make sure that the connectivity settings allow access from Databricks. If you have your Databricks instance on Azure, you can set “ Allow Azure services and resources to access this server” (se picture below) to yes. Be sure to use a strong password/access way, as the database is now accessible to the outside world. In case you would like a stronger security level, you can have a look at this guide.

Our Demo Case

As demo data we have a sql file holding course evaluations. You will find the file here on Github. We will insert this data into our Azure SQL server, connect to it in Databricks, transform it with pandas and write it back as a new table into Azure SQL. When the process is finished, we will schedule it to run once a month.

Setup and Preparations

To get down to business, we need a cluster that is up and running and an empty python Notebook. If you do not know how to set this up, check out step 1 and step 3 in this post.

You also need to create a table in Azure SQL and populate it with our sample data. Do this by (for example) going to the Query Editor in the Azure Portal and entering this query in the Query editor pane.

We will connect to the database with jdbc/PySpark and then save the data to a pandas dataframe. After this we make transformations in pandas. This is possible as we are currently working with a small dataset. If you need to handle Big Data, you can use PySpark or PySpark pandas instead.

In order for the connection to work you need the database name, the server name as well as your username and password (marked within <> signs in the code further below). When you go to the database in Azure, you see the database name and server name on the Overview page.

It is good practise not to write usernames and passwords in notebooks. In this tutorial we make an exception and do it, but for production databases you should use a key vault or similar. A key vault lets you give an alias to your sensitive information and then you can refer to the alias in your Notebooks, for example like this: dbutils.secrets.get(scope = “azurekeyvault_secret_scope”, key = “sqldbpwd”). Here is info on how to set up a Key Vault on Azure and here on how to connect it to Databricks. After you have done this, you can access the secrets from the KeyVault from within your notebooks.

You find the whole code for our demo below and here on GitHub (course_feedback_database).

Extract

We start with importing required libraries and creating a sparksession.

import pandas as pd
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName( "pandas to spark").getOrCreate()

We then configure the jdbc connection with our database information. Input variables are marked in bold.

jdbcHostname = "<database_server_name>.database.windows.net"jdbcDatabase = "<database_name>"jdbcPort = 1433jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)connectionProperties = {"user" : "<database_username_or_keyvault_alias>","password" : "<database_password_or_keyvault_alias>","driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

We can now extract the data and inspect it. Note that we need to define the table we want to extract from, here COURSE_FEEDBACK.

Spdf = spark.read.jdbc(url=jdbcUrl, table="COURSE_FEEDBACK", properties=connectionProperties)display(Spdf)

Transformation

In order to do the transformations in pandas, we need to convert the Spark dataframe into a pandas dataframe.

course_feedback = Spdf.toPandas()
course_feedback.info()

We transform the dataframe by adding a new column with the row wise mean of the feedback columns Setup, Material and Teacher. This equals the overall grade each students gave the course.

#Create Overall columncourse_feedback['COURSE_EVALUATION'] = course_feedback.iloc[:, 4:7].mean(axis=1).round(2)#Drop not needed columnscourse_feedback.drop(['SETUP','MATERIAL','TEACHER'], inplace= True, axis=1)#Inspectcourse_feedback.head(5)

Load

In order to load the new dataframe back to Azure SQL, we need to convert it into a Spark dataframe again.

Spdf_overall=spark.createDataFrame(course_feedback)

Now we can write it to the database as a new table. Note that we define the mode as overwrite. This means that the table will not be updated with new rows each time the code runs, but the whole table will be replaced. You find more mode alternatives here.

Spdf_overall.write.jdbc(url=jdbcUrl, table="COURSE_FEEDBACK_OVERALL", mode = "overwrite",properties=connectionProperties)

Confirm

We can confirm that the new table was created by (for example) using the Query Editor in the Azure Portal.

Schedule

When we have confirmed that the code works as intended, it is time to schedule it. In our demo case, new course evaluations come in every month and the evaluation for each month closes the first of the following month. We will schedule the code to run a few days after this every month.

Select Jobs in the left menu in Databricks and then Create Job. The window below will pop up. Fill in Task name and choose your Notebook. We will use a New Job Cluster for the scheduled runs, so we go ahead and specify it under Cluster/Edit.

You can configure the cluster the way you want, but for this little demo task, we go for one worker.

When the cluster is set up, click Edit Schedule in the right menu and fill in the preferred schedule in the window that pops up. If you want to, you can also create email alerts if your code failes to run. This is done under Edit Alerts in the right menu.

Status

We have now connected to an Azure SQL database, extracted data with PySpark,transformed it with pandas and loaded it back into the database with PySpark. We have also scheduled our code to tun once a month.