Data Engineering Zoomcamp Project

Published in

CodeX

3 min readDec 20, 2022

Hi Everyone and Happy Holidays,

For the past 2 months, I have been learning Data Engineering from the Zoomcamp videos playlist on the DataTalksClub channel on YouTube.

Why Data Engineering? Every Project I have worked on involved Extracting, Transforming and Loading data of some style which I have come to realize is one of the key responsibilities of a Data Engineer.

I have also used Cloud services (tools) such as Cloud Storage, Cloud Function, Cloud Scheduler and App Engine on those projects which are some of the skills required of a Data Engineer.

With that said, I want to share the project I worked on.

Dataset

I used the San Francisco Bay Area Bike Share on Kaggle. It can be found below;

SF Bay Area Bike Share

Anonymized bike trip data from August 2013 to August 2015

www.kaggle.com

Project Architecture

The ideal architecture of the project would have been as shown below but it wasn’t;

The challenge I had was not being able to download the CSV files in the dataset using a python script. So after many tries, I decided to bypass the step by downloading the files manually and storing them in a google cloud storage bucket.

**Project Architecture for the Data Engineering Zoomcamp Project**

The first storage bucket: where I stored the CSV files I downloaded manually.
The second storage bucket: where the same files are stored but this time as a parquet file extension and not CSV. The advantage of having the files as parquet files means the files take less storage space and queries will run faster compared to CSV files. You can read more on this below;

CSV Files for Storage? No Thanks. There’s a Better Option

Saving data to CSV’s is costing you both money and disk space. It’s time to end it.

towardsdatascience.com

BigQuery: data warehouse where the parquet files in the second storage bucket were moved to for storage(loading) and transformation.

The entire workflow (moving the files (data) from the first storage bucket to BigQuery) was orchestrated using the Airflow environment in Google Cloud Composer.

The advantage of using Google Composer for the Airflow environment means I can easily orchestrate the workflow without worrying about the underlying infrastructure needed.

With the data now in BigQuery, I used dbt cloud to transform the datasets into models (tables) I will be needing for the dashboard. The data lineage looks something like this

I used the Plotly library for the data visualization and the Dash library to build the web framework. Finally, I deployed the web app (dashboard) to the Google App Engine.

The dashboard can be viewed in the link below; It’s best viewed on a large screen for now. Trying to work around it so it can be flexible on different screen sizes.