Building an end-to-end data pipeline using Azure Databricks (Part-5)

Alonso Medina Donayre
4 min readSep 15, 2022

--

Data Ingestion and Transformation

In this article, we will be loading our files to our bronze container and processing for our silver container.

Step 1 — Download required files

You can download the data files from the following repository.

Step 2— Data Dictionary

I will describe the columns of each file so you can have a better understanding of the data.

Customer: This data needs to be treated with an UPSERT technique because it only send new and modified records daily.
- customerId: unique identifier
- firstName: name
- lastName: last name
- phone: home phone number
- email: email
- gender: Male or Female
- address: place where the customer lives
- is_active: flag that indicates if the client is with us

Customer Drivers: This data is generated daily from the RiskModeling area. The data that is sent is an snapshot of the day. This data will be loaded incrementally.
- date: date that the data was generated by RiskModeling area
- customerId: unique identifier of the customer
- monthly_salary: monthly salary in USD
- health_score: score - how important is the customer for the bank
- current_debt: current debt that the customer has with our bank
- category: segment of the customer

Loan Transactions: Data correspond to the transactions performed in a specific date. This data will be loaded incrementally.
- date: date of the transaction
- customerId: unique identifier of the customer
- paymentPeriod: term of the loan
- loanAmount: amount requested by the customer
- currencyType: currency of the loan (USD, EUR)
- evaluationChannel: channel by which the loan was sold
- interest_rate: rate of the loan

Step 3— Upload data to bronze container

For our solution we are not going to use data factory or any python code to load our data from local workspace to bronze container. You can do it and challenge yourself.

We are going to use Azure Storage Explorer to upload and manage our files in our containers, it’s a very useful and easy tool to use, you can download it from here. You need to sign in and you will see your storage account with your containers. Once you have set up your Storage Explorer, you just need to drag and drop the folders in it.

Step 4— Data Ingestion on Databricks Cluster

Now that our files are in our bronze container, it’s time to start developing the ingestion code.

The execution order of our files is the following:

  • On your databricks workspace create a folder called ingestion and inside it the following notebooks:
    - customer.py
    - customerDriver.py
    - loanTransaction.py
  • Also in your databricks workspace create a folder called includes and inside it the following notebooks:
    - common_functions.py
    - configurations.py
  • In common_functions file we are going to have all the function we can re-use in different python files.
  • In configurations file we are going to have general configurations as the path of our containers.
  • Now that we have our raw customer data in our bronze container, we need to transform it and saved on the silver container. Copy the code below into your customer python file.
  • We need to copy the code below in the customerDrivers python file.
  • Finally copy the code below for the loan transaction python file.

Step 5— Testing our files

  • A very easy way to test our files is by creating another python file wherever you want, I recommend you to create it inside another folder called “utilities”.
  • Copy the below code to the notebook and run it, you should get Sucess messages.

If you want to know more about dbutils and how we are using them in our ingestion, review the followings, they can be executed in your python files.

dbutils.widgets.help()
dbutils.notebook.help()

That’s everything for this article (part-5), I hope you enjoy it.

  1. Requirements
  2. Set up azure services
  3. Mount azure storage containers to Databricks
  4. Use case explanation
  5. Data Ingestion and Transformation
  6. Data Enrichment
  7. Pipeline using Data Factory

--

--

Alonso Medina Donayre

I am very interested in topics related to Data, Software and Management.