Building Data with bricks: Microsoft Azure Databricks
Our love affair with Azure Synapse has suddenly come to an end. Today we’ll be learning about another great service provided by Microsoft. Databricks has its advantages. In that it processes data stored in storage houses such as Azure Blob storage, Azure Data Lake Store, data warehouses and databases. It also supports data streaming.
Transitioning from Synapse to Databricks shouldn’t be that big a deal. It offers similar basic functionality such as ingesting data, transforming data and exporting finished data. In this article, we will go through a basic ETL Process in Databricks just like we did in Synapse.
In Databricks, we work with Notebooks. Databricks offers Tutorial under the Quickstart Notebook which gives you an introductory lesson to the application. You are allowed to use different programming languages; R, Python, Scala, Java, Spark SQL, in one notebook but in different cells. You can juggle between these languages by using magic commands.
To begin a task in Databricks, you will need to attach your notebook to a cluster after creating a new notebook. The type of cluster to be used depends on how demanding your task is. The Databricks runtime for Machine Learning (DBR 7.4ML) is an environment used to create GPU clusters for Machine learning tasks. A cluster with 24 Cores and 168.00 GB is suitable for Machine learning tasks on large data sets. Other options exist for different types of jobs.
To begin working on data, you will need to ingest it from a data source. This is the first part of the ETL Process — extraction.
In this article, we will query databases using JDBC drivers. The full form for JDBC is Java Database Connectivity and it is the Java API that allows us to connect to a database and issue queries. An alternative is the C-based OBDC (Open Database Connectivity) API.
To query, define the following with the appropriate credentials:
Our data here is stored in a database. Particularly to a blob storage. So, to get access to this database, sign in to your Azure portal and locate your Synapse workspace from your resources. In the Overview tab, find your Dedicated SQL endpoint and copy that as your jdbcHostname. The rest of the fields should be pretty straightforward.
Next, you need to specify the access key for accessing your data lake storage account. Set Arrow enabled to True. It’s disabled by default.
After running the above in your notebook your data is imported into Databricks as a Spark dataframe. The different options for reading it are as follows:
Replace “Spark_dataframe” with the name of the Spark table you want to read.
Another alternative to loading the table from the blob storage is this:
This method displays a list of the available files you can select from.
Don’t be confused with the use of the terms dataframe and table. They are used interchangeably. However, a dataframe is mostly associated with pandas.
Transformation of data will be done using Python. Thus, there’s a need to convert your Spark dataframe to a Pandas dataframe. Notice how we are manipulating data with different languages.
From here, after conversion, you are free to use Pandas to manipulate your data how you wish to produce desired results.
From transformation, any derived tables can be saved as CSV files. We are beginning the Loading process now. There are two ways I know how to. I’ll show you.
We will need to convert our Pandas dataframe to a pyspark dataframe.
where desired_csv is the pandas dataframe to be transformed into csv. This displays the table and an export button to download all rows.
For this alternative, getting a hold of the csv file requires a few more steps than the former. To retrieve the file, Select the Data tab on the left hand side of the browsing window. At the top right corner in the open window, click on Create Table.
At the Data source section, select the DBFS option. Select “FileStore”, then “df”, select your generated csv file then click on the fourth ‘part…’ file like in the picture below. Notice the path we specified to save our file shows us where to locate the file in Databricks. “.save(“dbfs:/FileStore/df/Name_to_be_savedAs.csv”)”.
After clicking on your file, copy the path name of the file from the bottom of the page.
Here comes the tricky part. Brace yourself.
Replace the ‘/FileStore’ in the copied text to ‘/files’. Copy the new text and paste it after the backslash in the search bar of the browser. Press enter and your file downloads automatically. Viola !
To delete a csv file, type this in a notebook cell:
The desired output should be ‘True’, indicating that the file has successfully been deleted. If you receive a ‘False’ output it means the file was not deleted. Cross-check the spelling of the file to be deleted, ensure it exists, and try again. In loading to the blob storage, don’t save table names starting with numerical values.
This article helped me learn how to save and delete the csv files. Check it out for another fresh perspective.
What’s not so great about Databricks:
- In Synapse Analytics, you can open multiple notebook tabs in one Synapse browser tab. However in Databricks, to open multiple notebooks simultaneously you’d have to open multiple browser tabs in the browsing window.
- As seen above, downloading a csv file in Databricks can be tedious depending on the method used.
- Databricks has a limitation on the number of cells to run in a notebook. Running over a 100 cells can affect performance of the notebook.
- Databricks cannot run without internet access.