End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization

Patrick Nguyen
5 min readJun 4, 2023

--

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI and Spark technology. In this article which is also the final part of the series, we will discuss how data analyst/BI developers analyze the dataset that have been ingested earlier by data engineering team. Also, we will learn how to leverage Power BI loading data from Databricks Hive for data visualization.

Please review the related articles in the series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

8. Data Analysis

Now we have transformed data ready for data consumers to analyze and visualize. Hold on, if you remember, the format that we use in the presentation layer is Parquet. Most data end users are familiar with tables and SQL so they can visualize data in Power BI/Tableau/Looker. Even though those tools can read Parquet formats, the transformation is not straightforward for end users. So, as data engineers, we may have different formats for data consumers to use in the presentation layer.

In this step, I will create a database and tables in Databricks stored in Hive Meta Store. A Hive metastore is a database that holds metadata about our data, such as the paths to the data in the data lake and the format of the data (parquet, delta, CSV, etc).

First, create a database named formula1-db:

create database if not exists formula1_db;

Then create 3 tables from 3 Parquet formats in the presentation folder: race_results, driver_standings and constructor_standings. the sample code as below

presentation_folder_path = "/mnt/formula1datalake133/presentation"

raceresults_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

raceresults_df.write.mode("overwrite").format("parquet").saveAsTable("formula1_db.race_results")

Make sure 3 tables are present in formula1_db database:

Alright, the data engineering team complete their tasks. Let’s give an example in which the data analysis team comes in and wants to do some analysis and visualization. Data Analysts want to query the dominant drivers and teams. Let’s see their steps

First, they review the data from the race_results table and find that the position points are different for various periods/years. Therefore, they may want to standardize points across years or they just simply find dominant drivers over 10 years where position points are consistent.

Let’s simplify the problem for our article, here I only choose the period from 2010 to 2020 and the position points are consistent over these years

Then they extract the top 10 drivers with the most total points over this period:

8. Data Visualization

From the table result, you can simply click plus (+) sign to create different visualization, here is the sample.

You can even create a quick dashboard in Databricks. That’s pretty cool and convenient, right?

Having said that, the features of data visualization are limited in Databricks. For advanced features and self-service dashboards for data consumers, you may want to use Power BI. Let’s connect Databricks data source for Power BI’s dashboards:

After connection setup, you can view your Databricks Hive store with tables. Select and load driver_standings into Power BI

Then data analysts can make nice reports and dashboards and present them to management for decision-making. You can review my other post about Strategies to turn data into insights

We complete the end-to-end data engineering project about Formula 1 on Azure. To recap, here are the steps that we’ve completed:

  1. Gathered requirements and collected source data
  2. Designed solution architecture for our requirements
  3. Setup environments on Azure and initiated Databricks cluster
  4. Grant access for Databricks to Azure Data Lake
  5. Loaded source data into Azure Data Lake Storage
  6. Ingested and transformed data into Parquet formats on Databricks
  7. Created data pipelines in Azure Data Factory
  8. Scheduled and monitored data pipelines on Azure Data Factory
  9. Created database and tables from Parquet files in Databricks
  10. Analyzed data in Databricks and visualized data in Power BI

That’s a long process and congratulations to all of you getting to this point. Next articles I will write more about data engineering topics that we have not touched on in this series such as Incremental Load, integrating data sources by using ADF, dealing with streaming data, and data governance. Please make sure you subscribe to get notified about my new articles.

Patrick N

If you like what you read, consider joining Medium and reading many more articles. A portion of your fee goes to support authors like me. Click here to join.

--

--

Patrick Nguyen

Data Enthusiast with more than 15 years of experience in Data Engineering, Data Warehouse and Data Analytics. Ex Oracle/IBM