Azure End-To-End Data Engineering Project: Harnessing the GitHub Student Pack for Students and Beginners
Bring your Hot Chocolate and Enjoy Reading ๐ต
๐ Hey, future data wizards and Azure enthusiasts! Ready to dive into the captivating world of Azure data engineering but not sure where to begin? I come bearing fantastic news! Thanks to the GitHub Student Pack, you can open the door to a full year of thrilling Azure adventures, no strings attached.
This guide is tailored specifically for students and Azure beginners. Iโll take you on an exhilarating journey through an entire end-to-end Azure data engineering project. Even better, Iโll reveal the secrets to securing your very own Azure account for a full year by leveraging the incredible benefits of the GitHub Student Pack. Just imagine โ a whole year dedicated to hands-on learning and experimenting with Azureโs potent data tools, all right at your fingertips!
What you will learn:
- Access Azure via GitHub Student Pack for One Year.
- Data Ingestion with Azure Data Factory (ADF).
- Exploring Azure Data Lake Storage (ADLS) and Its Capabilities.
- Performing Data Transformation and Advanced Analytics with Databricks.
- Using Azure Synapse Analytics for Data Warehousing and Analytics Capabilities.
- Harnessing Power BI for Data Visualization and Business Intelligence.
Unlocking Azure with the GitHub Student Pack:
- Go to the GitHub Education page (GitHub Student Developer Pack โ GitHub Education) and sign in or create an account if you havenโt already.
- Confirm your student status by providing necessary identification.
- Once verified, go to the โGitHub Student Developer Packโ section and browse the available offers.
- Search for the Azure benefit within the pack.
- You will likely be directed to the Azure portal. Follow the instructions provided to activate your Azure subscription using the GitHub Student Pack offer.
- Once your Azure subscription is activated, log in to the Azure portal using your newly acquired credentials and start exploring the wide range of services and tools available.
General overview of the project:
Beginning with the 2021 Olympics data stored in GitHubโs CSV files (accessible here: kaoutharElbakouri/2021-Olympics-in-Tokyo-Data (github.com)), this project employs Azure Data Factory (ADF) to smoothly bring this information into the raw layer of Azure Data Lake Storage (ADLS). Moving forward, Azure Databricks takes the lead, refining the dataset and storing the processed data in ADLSโs transformed layer. Azure Synapse Analytics steps in, primarily for robust data warehousing and detailed analysis, allowing deeper exploration and insights. Finally, Power BI visualizes these insights, marking the completion of a step-by-step process and providing a rich and comprehensive view of the 2021 Olympics dataset.
Hereโs the diagram which shows the workflow of the project weโre going to build:
We are going to split this project into following sections:
1๏ธโฃ. โ๐จโโ๐ทโโ๐ชโโ๐ฆโโ๐นโโ๐ชโ โ๐ฆโ โ๐ทโโ๐ชโโ๐ธโโ๐ดโโ๐บโโ๐ทโโ๐จโโ๐ชโ โ๐ฌโโ๐ทโโ๐ดโโ๐บโโ๐ตโโฆ
The initial step under the Azure for Students subscription is to create a resource group, specifying its name and region.
This is our created resource group that will acts as a container to hold related Azure resources for efficient management and organization:
2๏ธโฃ. โ๐จโโ๐ทโโ๐ชโโ๐ฆโโ๐นโโ๐ชโ โ๐ฆโ โ๐ธโโ๐นโโ๐ดโโ๐ทโโ๐ฆโโ๐ฌโโ๐ชโ โ๐ฆโโ๐จโโ๐จโโ๐ดโโ๐บโโ๐ณโโ๐นโโฆ
Building upon the resource group, the next step is to create an Azure Data Lake Storage ADLS (a storage account with namespace hierarchy enabled).
- Select the created resource group.
- Click on the โCreate a resourceโ button within the resource group.
- In the search bar, type โStorage accountโ and select the option for creating a new storage account.
- Enter the necessary details for the storage account, such as:
โ๏ธ Subscription: Choose your Azure for Students subscription.
โ๏ธ Resource group: Select created resource group
โ๏ธ Storage account name: Enter โtokyoolympicdatastorage.โ
โ๏ธ Region: Choose the appropriate region (e.g., West Europe).
โ๏ธ Performance, Replication, and Access tier: Select as per your requirements.
โ๏ธ Hierarchical namespace: Enable this option to create Azure Data Lake Storage (ADLS).
The next step is to create a container. Inside the ADLS account, locate and select the โContainersโ section, Click on โNew Containerโ and enter the desired name for the container, for example, โtokyoolympicdataโ, then click create.
The next step is to create two directories within this container. Navigate to the specific container, click on the โAdd Directoryโ. Enter the directory name as โraw-dataโ and confirm the creation. Click on โAdd Directoryโ again and enter the directory name as โtransformed-dataโ and confirm the creation. This is a common way to organize data.
3๏ธโฃ. โ๐ฎโโ๐ณโโ๐ฎโโ๐นโโ๐ฎโโ๐ฆโโ๐นโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ฎโโ๐ณโโ๐ฌโโ๐ชโโ๐ธโโ๐นโโ๐ฎโโ๐ดโโ๐ณโโฆ โ๐นโโ๐ทโโ๐ฆโโ๐ณโโ๐ธโโ๐ซโโ๐ชโโ๐ทโโ๐ทโโ๐ฎโโ๐ณโโ๐ฌโ โ๐จโโ๐ธโโ๐ปโ โ๐ซโโ๐ฎโโ๐ฑโโ๐ชโโ๐ธโ โ๐ซโโ๐ทโโ๐ดโโ๐ฒโ โ๐ฌโโ๐ฎโโ๐นโโ๐ญโโ๐บโโ๐งโ โ๐นโโ๐ดโ โ๐ฆโโ๐ฉโโ๐ฑโโ๐ธโ โโ๐ทโโ๐ฆโโ๐ผโ-โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโโ โ๐ฉโโ๐ฎโโ๐ทโโ๐ชโโ๐จโโ๐นโโ๐ดโโ๐ทโโ๐พโ ๐บโโ๐ธโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ฆโโ๐ฉโโ๐ซโ:
Now that weโve established the โraw-dataโ directory, weโre set to initiate the ingestion process using Azure Data Factory ADF. This involves transferring data from CSV files stored in GitHub to the Azure Data Lake Storage (ADLS).
To do so, we need to create an ADF resource:
โ๏ธ From the Azure portal, in the search bar, type โData Factoryโ and select โData Factoryโ from the search results.
โ๏ธ Click โCreateโ to begin the setup of a new Azure Data Factory resource.
โ๏ธ Enter the necessary information:
- Name: Provide a unique name for your ADF resource.
- Subscription: Choose your subscription (Azure for Students).
- Resource Group: Select the already created resource group.
- Version: Choose the ADF version you want to use.
โ๏ธ Once everything is validated, click โCreateโ to initiate the provisioning of your new Azure Data Factory resource.
Here is our created ADF resource:
Once your Azure Data Factory (ADF) resource is created, you can begin working on it by launching ADF Studio.
Our data source in GitHub comprises five CSV files.
Letโs concentrate on ingesting the โAthletesโ CSV file from our GitHub data source into Azure Data Lake Storage (ADLS) using Azure Data Factory (ADF). The same steps will be replicated for the ingestion of the other files as well.
โ๏ธ Create the necessary Linked services:
To effectively handle each CSV file, itโs essential to create two critical connections: a source linked service and a destination linked service. Our journey begins by crafting a linked service that establishes the crucial connection between Azure Data Factory (ADF) and the diverse data sources involved.
Under the ADF (Azure Data Factory) Manage hub, you can find the option for โLinked Services.โ By clicking โNewโ and choosing โHTTPโ as the connection type (our data is on GitHub) youโll initiate the process of creating a new linked service:
Once youโve selected the โHTTPโ connection type, proceed by naming the linked service, providing the base URL, selecting โAnonymousโ for authentication, and opting for โAutoResolveIntegrationRuntimeโ for the integration runtime setting.
For the Base URL, select Athletes csv file in github then click raw data and copy the link.
Letโs create the Azure Data Lake storage linked service by following the same steps as for the previous linked service but instead of HTTP for connection type, choose Azure Data Lake Storage Gen2 and provide the necessary details:
Here are our created linked services:
โ๏ธ Create the Datasets:
Now that our source and destination linked services have been created, we can create the Datasets that will be referenced by the ADF activities. For that, under the ADF Author Hub, locate the Datasets section and create new Dataset.
For the source Dataset, we need to choose HTTP for data store type.
Then choose CSV for the format type:
Then name your Dataset and specify the already created Linked service (do not forget to publish the Dataset):
Now letโs create the destination Dataset with the same steps but with ADLS as data store type instead of HTTP and set the required properties for our Dataset.
Now that our two Datasets have been created, we can proceed with creating the ADF activity to do the ingestion from source to destination.
โ๏ธ ADF Ingestion Activity:
Under the Author Hub, locate the pipelines section and create new pipeline, then select the copy activity from the Activities section.
The next step is to configure the source and sink for our activity:
Now we can start debugging our pipeline by clicking on Debug.
our pipeline succeeded and we can see Athletes data loaded to ADLS container (raw-data).
After following the same steps for the remaining CSV files, we need to have these CSV files located in raw-data folder within our ADLS container:
4๏ธโฃ. โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ตโโ๐ทโโ๐ดโโ๐จโโ๐ชโโ๐ธโโ๐ธโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ฆโโ๐ณโโ๐ฉโ โ๐นโโ๐ทโโ๐ฆโโ๐ณโโ๐ธโโ๐ซโโ๐ดโโ๐ทโโ๐ฒโโ๐ฆโโ๐นโโ๐ฎโโ๐ดโโ๐ณโ โ๐ผโโ๐ฎโโ๐นโโ๐ญโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโโ๐งโโ๐ทโโ๐ฎโโ๐จโโ๐ฐโโ๐ธโโฆ โ๐ชโโ๐ณโโ๐ทโโ๐ฎโโ๐จโโ๐ญโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ฆโโ๐ฉโโ๐ฑโโ๐ธโ โ๐นโโ๐ทโโ๐ฆโโ๐ณโโ๐ธโโ๐ซโโ๐ดโโ๐ทโโ๐ฒโโ๐ชโโ๐ฉโ โ๐ฑโโ๐ฆโโ๐พโโ๐ชโโ๐ท
Once data has been ingested into the raw layer of Azure Data Lake Storage (ADLS), the next stage involves processing this data using Databricks and then storing it in the transformed layer within ADLS.
For that we need first to create a Databricks resource. To do it, from Azure portal Search for โDatabricksโ and select โAzure Databricksโ from the available services. Click โCreateโ and fill in the required information, such as resource group, workspace name, region, and pricing tier.
Once created, click on Launch Workspace and you will be directed to the Databricks workspace. And once Databrick workspace is launched, locate the compute section (This is where you manage clusters) and click on Create to set up a cluster based your configurations.
For me, a cluster is already created and ready to execute Spark code:
Now, we can write simple code to get data from ADLS raw-data layer, transform it and put it into ADLS transformed-data layer. To do so, we need to create a notebook by clicking โNew Notebookโ and make sure the created Spark cluster is selected:
We have to create the connection between Databricks and ADLS so we can easily access the data. For that, from the Azure portal search for the App Registrations and click on โNew Registrationโ to register a new application by providing the name of the app:
This is our registered app. From that we need the Client ID and Tenant ID:
Within this application locate the โCertificates & secretsโ section and create a secret that we will need for the connection between Databricks and ADLS:
keep the value of the created secret key for further usage:
Now, we can use the three credentials (client ID, Tenant ID and value of secret key) to connect Databricks to ADLS.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "34f9f5d3-593c-4389-a3af-78adfecc736b", #Client ID
"fs.azure.account.oauth2.client.secret": 'yaf8Q~FEMX87TQw5a03cSNm6CrOL2vGxQ-qh8dfM', #Value of secret key
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/eb12f8ec-35f2-415d-97bf-0e34301876a7/oauth2/token"} #Tenant ID
dbutils.fs.mount(
source = "abfss://tokyoolympicdata@tokyouolympicdatastorage.dfs.core.windows.net", # contrainer@storageacc
mount_point = "/mnt/tokyoolymic",
extra_configs = configs)
And to have the permission to access the files stored in ADLS, we need to explicitly giving this access using the credentials of the registered application. So, we need to give permission to this app to access the Data Lake. Under our ADLS container, Click on Access Control (IAM) then click add role assignement:
Select the โStorage Blob Data contributorโ role, click next and Click on Select members and select the registered app (app1):
To check that we were able to successfully create the connection to ADLS, we can check with this code within our notebook:
%fs
ls "/mnt/tokyoolymic"
Now, we can read our CSV files:
athletes = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/tokyoolymic/raw-data/athletes.csv")
coaches = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/tokyoolymic/raw-data/Coaches.csv")
entriesgender = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/tokyoolymic/raw-data/EntriesGender.csv")
medals = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/tokyoolymic/raw-data/Medals.csv")
teams = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/tokyoolymic/raw-data/Teams.csv")
we can print the athletes Data:
We can go over the Apache Spark functions and do whatever transformations we need on our data. (the goal here is not to do the transformations ๐but just to show how we can use Databricks, build the connectionโฆ).
The next step is to write the (transformed ๐) data to the transformed-data layer of ADLS:
athletes.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/athletes")
coaches.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/transformed-data/coaches")
entriesgender.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/transformed-data/entriesgender")
medals.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/transformed-data/medals")
teams.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/transformed-data/teams")
Here is the folders that contain the written data to the storage within the metadata as well:
For example, within athletes folder we have (the first three files contain metadata and the last one is for data):
Now that our data is transformed ๐ and stored into the transformed-data layer of ADLS, the next step is to load it into Azure Synapse Analytics where we can use SQL to do our analysis to get insights from the data or build a dashboard on top of it ๐
5๏ธโฃ. โ๐ฑโโ๐ดโโ๐ฆโโ๐ฉโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ซโโ๐ทโโ๐ดโโ๐ฒโ โ๐ฆโโ๐ฟโโ๐บโโ๐ทโโ๐ชโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ฑโโ๐ฆโโ๐ฐโโ๐ชโ โ๐ธโโ๐นโโ๐ดโโ๐ทโโ๐ฆโโ๐ฌโโ๐ชโ (โ๐ฆโโ๐ฉโโ๐ฑโโ๐ธโ) โ๐นโโ๐ดโ โ๐ฆโโ๐ฟโโ๐บโโ๐ทโโ๐ชโ โ๐ธโโ๐พโโ๐ณโโ๐ฆโโ๐ตโโ๐ธโโ๐ชโ โ๐ฆโโ๐ณโโ๐ฆโโ๐ฑโโ๐พโโ๐นโโ-๐ฎโโ๐จโโ๐ธโ โ๐ซโโ๐ดโโ๐ทโ โ๐ฆโโ๐ฉโโ๐ปโโ๐ฆโโ๐ณโโ๐จโโ๐ชโโ๐ฉโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ฆโโ๐ณโโ๐ฆโโ๐ฑโโ๐พโโ๐นโโ๐ฎโโ๐จโโ๐ธโ:
From Azure portal, search for Azure Synapse Analytics and click on โCreateโ and provide the required details:
Once it is complete, we can move forward to load our data into Synapse Analytics. To do so, open Synapse Studio and click on the Data hub then click the + button and choose Lake database:
Create the Database and name it:
Once the database is created, we can create the table and load the data (choose create table from data lake):
And provide the table details: name, linked service (choose the default one) and for input file just navigate to the file that contains athletes data inside the transformed-data layer of ADLS (make sure to choose that contains the data and not the metadata).
Upon following the identical procedure for additional data files, our Synapse database now contains the following tables as expected:
These tables provide the foundation for running SQL queries to perform analysis and derive insights:
6๏ธโฃ. โ๐จโโ๐ดโโ๐ณโโ๐ณโโ๐ชโโ๐จโโ๐นโโ๐ฎโโ๐ณโโ๐ฌโ โ๐ตโโ๐ดโโ๐ผโโ๐ชโโ๐ทโ โ๐งโโ๐ฎโ โ๐นโโ๐ดโ โ๐ฆโโ๐ฟโโ๐บโโ๐ทโโ๐ชโ โ๐ธโโ๐พโโ๐ณโโ๐ฆโโ๐ตโโ๐ธโโ๐ชโ โ๐ซโโ๐ดโโ๐ทโ โ๐ฉโโ๐ฆโโ๐นโโ๐ฆโ โ๐ปโโ๐ฎโโ๐ธโโ๐บโโ๐ฆโโ๐ฑโโ๐ฎโโ๐ฟโโ๐ฆโโ๐นโโ๐ฎโโ๐ดโโ๐ณโ โ๐ฆโโ๐ณโโ๐ฉโ โ๐ฆโโ๐ณโโ๐ฆโโ๐ฑโโ๐พโโ๐ธโโ๐ฎโโ๐ธโ
The final step involves establishing a connection between Power BI and Synapse, enabling the use of Power BI for comprehensive data analysis and visualization based on the Synapse database tables.
Open Microsoft Power BI Desktop, click on โGet Dataโ and search for Azure Synapse Analytics SQL as a data source then connect:
Go to Synapse Analytics Workspace overview and copy the โServerless SQL endpointโ:
In Power BI, insert the copied string for the server name and Connect:
Finally, we can select the tables we want, load into Power BI then start our Analysis and Dashboarding works ๐
๐๐ถ๐น๐ ๐๐พ๐๐ฝ ๐ฟ๐๐๐ ๐ต๐ ๐ฆ๐ถ๐๐๐๐ฝ๐ถ๐ ๐ธ๐ฟ ๐ต๐๐ฆ๐ช๐ฐ๐ ๐ผ ๐ฉโ๐ป