Data Engineering Zoomcamp Project

Oladayo
CodeX
Published in
3 min readDec 20, 2022

Hi Everyone and Happy Holidays,

For the past 2 months, I have been learning Data Engineering from the Zoomcamp videos playlist on the DataTalksClub channel on YouTube.

Why Data Engineering? Every Project I have worked on involved Extracting, Transforming and Loading data of some style which I have come to realize is one of the key responsibilities of a Data Engineer.

I have also used Cloud services (tools) such as Cloud Storage, Cloud Function, Cloud Scheduler and App Engine on those projects which are some of the skills required of a Data Engineer.

With that said, I want to share the project I worked on.

Dataset

I used the San Francisco Bay Area Bike Share on Kaggle. It can be found below;

Project Architecture

The ideal architecture of the project would have been as shown below but it wasn’t;

Ideal Project Architecture

The challenge I had was not being able to download the CSV files in the dataset using a python script. So after many tries, I decided to bypass the step by downloading the files manually and storing them in a google cloud storage bucket.

Project Architecture for the Data Engineering Zoomcamp Project
  • The first storage bucket: where I stored the CSV files I downloaded manually.
  • The second storage bucket: where the same files are stored but this time as a parquet file extension and not CSV. The advantage of having the files as parquet files means the files take less storage space and queries will run faster compared to CSV files. You can read more on this below;
  • BigQuery: data warehouse where the parquet files in the second storage bucket were moved to for storage(loading) and transformation.

The entire workflow (moving the files (data) from the first storage bucket to BigQuery) was orchestrated using the Airflow environment in Google Cloud Composer.

The advantage of using Google Composer for the Airflow environment means I can easily orchestrate the workflow without worrying about the underlying infrastructure needed.

With the data now in BigQuery, I used dbt cloud to transform the datasets into models (tables) I will be needing for the dashboard. The data lineage looks something like this

data lineage of the transformations

I used the Plotly library for the data visualization and the Dash library to build the web framework. Finally, I deployed the web app (dashboard) to the Google App Engine.

The dashboard can be viewed in the link below; It’s best viewed on a large screen for now. Trying to work around it so it can be flexible on different screen sizes.

Screenshot of the web app

The repository for the project can be found below;

I tweeted through the project. You can see the thread here;

https://twitter.com/oladii1/status/1596128470748352514?t=3LGxDXCKXzGv6-beKmQQRw&s=03

Thank you for reading.

--

--

Oladayo
CodeX
Writer for

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.