Building an end-to-end data pipeline using Azure Databricks (Part-6)
Data Enrichment
In this article, we are going to perform some joins with our tables, aggregations, we will use upsert technique and finally query our tables.
Step 1 — Preparing scenario
Create a new folder on your workspace named enrichment.
Inside enrichment create 2 new python files named:
- customer.py
- loantTrx.py
Inside set-up folder create a new python file named:
- database.py
Step 2— Creating database
Copy the below code to the database notebook and execute it. Once execution finish you have mounted your database over the gold container.
Note: If you drop your database which is mounted on your gold container, all the files inside that container are going to be deleted too.
Step 3— Enrichment Customer
Copy the code below to your customer notebook inside enrichment folder, do not execute it yet, cause we are going to test it later.
Step 4— Enrichment Loan Transactions
We are going to generate two tables from loan transaction data, one table would a feature tale and the other one would be and aggregate table. Copy the code below to your python notebook.
Step 5— Testing Pipeline
It’s time to uncomment our two las code blocks. You can run your script for the following p_file_date values:
- “2022–09–10”
- “2022–09–11”
- “2022–09–12”
Step 6— Querying database tables
Inside your utilities folder, create a SQL Notebook, we will be querying the tables of our database. You can use sql for exploratory analysis on your data.
Copy the code below to your notebook and run it.
That’s everything for this article (part-6), now we are ready to move to azure data factory.