Building an end-to-end data pipeline using Azure Databricks (Part-6)

Alonso Medina Donayre
3 min readSep 16, 2022

--

Data Enrichment

In this article, we are going to perform some joins with our tables, aggregations, we will use upsert technique and finally query our tables.

Step 1 — Preparing scenario

Create a new folder on your workspace named enrichment.

Inside enrichment create 2 new python files named:
- customer.py
- loantTrx.py
Inside set-up folder create a new python file named:
- database.py

Step 2— Creating database

Copy the below code to the database notebook and execute it. Once execution finish you have mounted your database over the gold container.

Note: If you drop your database which is mounted on your gold container, all the files inside that container are going to be deleted too.

Step 3— Enrichment Customer

Copy the code below to your customer notebook inside enrichment folder, do not execute it yet, cause we are going to test it later.

Step 4— Enrichment Loan Transactions

We are going to generate two tables from loan transaction data, one table would a feature tale and the other one would be and aggregate table. Copy the code below to your python notebook.

Step 5— Testing Pipeline

It’s time to uncomment our two las code blocks. You can run your script for the following p_file_date values:
- “2022–09–10”
- “2022–09–11”
- “2022–09–12”

Step 6— Querying database tables

Inside your utilities folder, create a SQL Notebook, we will be querying the tables of our database. You can use sql for exploratory analysis on your data.

Copy the code below to your notebook and run it.

That’s everything for this article (part-6), now we are ready to move to azure data factory.

  1. Requirements
  2. Set up azure services
  3. Mount azure storage containers to Databricks
  4. Use case explanation
  5. Data Ingestion and Transformation
  6. Data Enrichment
  7. Pipeline using Data Factory

--

--

Alonso Medina Donayre

I am very interested in topics related to Data, Software and Management.