The simplest ETL stack in Azure — Data Engineering
Starting a Data Engineering project can be daunting, especially for newcomers. One of the initial challenges is understanding the flow of activities required to begin processing data effectively. This post aims to provide an example of a fundamental ETL process using Excel files, orchestrated with Data Factory, Databricks, and Data Lake.
I’m not going to explain how to create the resources in Azure, since they’re mostly self explanatory in the Azure Portal, but we are going to see the configuration needed in each resource.
Data Lake Storage
We need to create the next containers:
raw: Used to get the data “as-is” from our sources.
staging: Used to put the data once it has been processed. This is the data that’s going to end in our Warehouse Database.
The container “simulatingasource” is only used to host our Excel file as a data source.
You can download data samples from this site: https://www.contextures.com/xlsampledata01.html
Databricks
You must create a container, the access mode must be “No isolation shared” for it to work with Data Factory.
We also need to generate an access token to create the linked service in Data Factory.
Data Factory
We are going to need to create the Linked services to our Databricks and Storage Account resources.
Once we have this configuration set, we can start our ETL process:
We are going to create a Binary Dataset for the Excel file, we won’t use the Excel format because we’re going to use Databricks to proccess the data.
We’ll use parameters to make the Dataset dynamic.
We are also going to create a Parquet dataset as our processed file is going to be in that format.
After that, we’ll create a pipeline and the first activity will be copy the data from the source to our raw container.
Then, we’ll debug the pipeline so we can check if the activity works as intended.
Finally, we are going to check both the source and the sink folders in our datalake.
We can see that then Excel file has been copied between the two folders.
Our next step will be extracting the data from the Excel file and using that to create a parquet file:
We need to create a notebook in our workspace, then we’ll set that notebook in Data Factory.
To read the Excel file in the Databricks notebook, we are going to use the Storage Account key, as it’s the fastest method of connection, but I reccomend using a Service Principal in a real project.
After setting the configuration, we’ll exectute the notebook to see if we can reach the file.
Once we set the connection, we’ll set the data from the file in a Pandas dataframe (it’s easier to work with Excel files in Pandas, but if you want to work with a big ammount of data, you’ll have to use Spark dataframes)
We’ll add the next code in our notebook:
import pyspark.pandas as pd
file_location = "abfss://raw@stdemoana001.dfs.core.windows.net/sales/SampleData.xlsx"
df = pd.read_excel(file_location, sheet_name = "SalesOrders",engine='openpyxl')
display(df)
To be able to execute this code, we need to install the “openpyxl” library in our cluster.
Usually, we’ll filter our data in this step, but this time, we’ll convert all the dataset in a parquet file.
file_destination = "abfss://staging@stdemoana001.dfs.core.windows.net/sales/SampleData.parquet"
df.to_parquet(file_destination,
compression='snappy')
pd.read_parquet(file_destination)
In the Data Lake, we can see that there’s a parquet file (folder) with our data.
As a final step, we’ll simulate the insertion of the parquet archive in our Datawarehouse, in this case we’ll use a .csv file as the destination.
Since our .parquet file is actually a folder, we’ll have to use a wildcard to load all the contents.
Finally, we’ll create a CSV dataset, and use it as the Sink destination.
Once we debug the complete pipeline, we’ll see the final result in the .csv file in our Data Lake.
This guide will provide you with a solid foundation to kickstart your Data Engineering projects. However, it’s important to recognize that there are several essential steps you’ll need to undertake before you can successfully deploy an ETL solution. A crucial initial task involves addressing the significant security vulnerability from the inadequate protection of credentials in this guide. But this tasks and more will be explained in future posts. Feel free to ask questions in the comments, until next time.