Data Engineering Hands-on with Databricks and Python in 30 minutes

Patrick Nguyen
5 min readAug 18, 2023

--

In today’s tech landscape, you’re surrounded by buzzwords like Big Data, Data Engineering, data ingestion and transformation, Python, Spark, and DataBricks. You may have read numerous articles and definitions, yet still find yourself wondering about their practical applications. In this piece, you’ll get a hands-on opportunity to actively code these concepts and personally engage with them in under 30 minutes. You’ll be pleasantly surprised by how straightforward the coding process is.

First, what is Databricks? DataBricks has been gaining popularity by data engineering community and companies, and is considered an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data. DataBricks is an unified platforms containing many tools and concepts that maybe confusing to beginners.

The primary objective of this article is to offer beginners a rapid hands-on experience with DataBricks, programming the process of data ingestion and transformation. We’ll become acquainted with terms like PySpark and SparkSQL along the way. This article streamlines the process, requiring no dataset downloads or environment setups on your part. I’ll provide step-by-step instructions to help you grasp the power and versatility of DataBricks.

For those who are familiar with the Databricks basics, you can check out another article here End-to-end Azure data engineering project

Let’s get started. First, you need to sign up for DataBricks Community Edition (the free version of DataBricks): https://docs.databricks.com/en/getting-started/community-edition.html. The steps are straightforward like online email registration, you don’t have to pay anything and you will have access immediately to Databricks.

After signing in, you can create a new notebook in the workspace as in the screenshot below, a notebook is just a place in Databricks for you to code and execute scripts:

You can rename of the notebook, the default language is Python:

Make sure that you have clusters attached to your Notebook (top right corner). Here is the concept of parallel or massive data processing coming in. Cluster is a pool of computers working together to execute your code parallelly.

In this project, we will use the popular New York Taxi Trip dataset available on Kaggle. However, to reducing the processing time, we will use the subset of data which is NYC data in May 2023.

Run the below code in your notebook by: Copying the code to your notebook and click Run Cell or Shift + F5.

May 2023 NYC data will be downloaded and saved to Databricks temp folder, then moved to the raw folder. I also create processed folder to store transformed data later. Note that the file is in Parquet format which is the default file format of Databricks.

# IMPORTS
import urllib
from pyspark.sql.functions import col

# File download
dbutils.fs.mkdirs('dbfs:/downloads/data')
urllib.request.urlretrieve('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet', '/tmp/yellow_data.parquet')

# Simulate Layers
dbutils.fs.mkdirs('dbfs:/nyc_taxi/raw')
dbutils.fs.mkdirs('dbfs:/nyc_taxi/processed')

# Move to location
dbutils.fs.mv("file:/tmp/yellow_data.parquet", "dbfs:/nyc_taxi/raw/yellow_data.parquet")

Then run the code below to validate if the file exists in raw folder:

# Validation of raw existance
display(dbutils.fs.ls('dbfs:/nyc_taxi/raw/'))

If you see the file as in below screenshot. Congrats!!! you completed one of the most complicated process in data engineering which is data ingestion. You see. how easy it is, right. If you don’t see the file, just follow above steps again.

Next, you can read the file in raw folder and save it to dataframe. Dataframe is just a datatype of Spark that save the data in columns and rows, similar to tables of databases.

# Read data from raw folder
df = spark.read.parquet('dbfs:/nyc_taxi/raw/')
display(df)

You will see the data is displayed like table format:

You can count the number of records. here we have around 3.4 million records in raw data:

Next, I will augment this raw dataset. Business partners only want to analyze the data with pasenger_count greater than 0, total_amount > 10 and trip_distance > 1. The below code will filter the data

# Process the raw_data
list_subset = ['passenger_count', 'trip_distance', 'total_amount']

processed_df = df.where(
(df.passenger_count > 0) &
(df.total_amount > 10) &
(df.trip_distance > 1)
)
display(processed_df)

After filtering, the number of records of processed dataframe is around 2.4 mil

Then we can write the final processed data to folder pr

# Write data to folder
processed_df.write.mode("overwrite").parquet('dbfs:/nyc_taxi/processed')

Note that we’ve done until now is to read data, process and save it in parquet formats. This format and dataframe won’t allow end consumers to use SQL language. We have to create tables/views in order to use SQL. The easiest way to do that is to use CreateView function

processed_df.createOrReplaceTempView("trip_data")

We’ve create View trip_data in the above code, now we can use SQL statements. Databricks allow us to use different languages in a notebook, just use the notation %sql

Another way is use SQL language in Databricks is to use SparkSQL. The code here is to select count of vendor and convert it to Pandas dataframe, then you can use function of Pandas dataframe such as plot to create pie chart

# Create some visualization (Pandas/Matplotlib) [Vendor ID]
df_viz1 = spark.sql("SELECT COUNT(VendorID) as count_vendor, VendorID FROM trip_data GROUP BY VendorID").toPandas()
display(df_viz1)
df_viz1.plot.pie(y='VendorID',legend=False, figsize=(10, 10))

You’ve completed your first Databricks project. As I promise earlier, the whole process took less than 30mins. The article is to help you have a quick experience with databricks data engineering process.

Patrick N

If you like what you read, consider joining Medium and reading many more articles. A portion of your fee goes to support authors like me. Click here to join.

--

--

Patrick Nguyen

Data Enthusiast with more than 15 years of experience in Data Engineering, Data Warehouse and Data Analytics. Ex Oracle/IBM