(Review) Udacity Data Engineer Nanodegree

Nicolas Soria
3 min readApr 27, 2020

--

Or A.K.A, a journey to become a modern data engineer.

Current background

I am currently working for a fin-tech startup in a data engineering role, which means there are some concepts explained throughout the course that I am already familiar with.

However, I felt that I could use this quarantine to exercise my mind by taking this course in order to challenge my understanding of the data engineer concepts/technologies.

Sections review

Data Modeling

In this lesson we learn the basics of working with data, which is how to model it for relational databases (PostgreSQL) and non-relational databases (Apache Cassandra).

Concepts of normalization and denormalization are being introduced with hands-on projects. You’ll get to do your first ETL and analyze it later with a Jupyter Notebook on both databases.

Cloud Data Warehousing

This sections presents the concepts of data warehousing and the different techniques around (Kimball, Inmon, Hybrid), OLAP vs OLTP, Data Marts, etc. I’ve found this very useful because there are times that I forget this concepts since I focus too much into technologies.

Later it is shown the benefits from having warehouse hosted in cloud providers, in this case AWS. Then we start to use services such as EC2, S3, Redshift or RDS.

We conclude by developing a project that loads raw data from AWS S3 to AWS Redshift to be later modeled for analytical purposes and shared through a Jupyter Notebook.

Data Lakes and Apache Spark

This chapter came with the more interesting, and therefore my favorite section of the nanodegree, big data concepts and technologies. Both Hadoop ecosystem and Apache Spark are introduced by replying the main question “why do we need this when we had developed a mature data warehouse concepts and technologies support?”

We learn concepts such as distributed processing, storage, schema flexibility, different files formats.

Along with that comes Apache Spark in Python, how to use it within AWS environment and why is currently widely accepted. Chapter project was about ETL pipeline creation for a data lake, using data stored in AWS S3 in JSON format.

Data Pipelines and Apache Airflow

Last chapter is about how to orchestrate all the concepts/technologies that we learned. Data pipelines and DAGs are explained along the open source tool developed by Airbnb called Apache Airflow and its operators, sensors, and plugins.

Another great concept explained with Airflow was data lineage. Data lineage and data governance are topics that will become a commodity in the next years.

Chapter project consists in create a data pipeline to do ETL against AWS Redshift, focusing on how to use the tool.

Capstone Project (Pending)

The capstone project is a sum-up of all topics covered, as expected. You have to build and entire ETL/ELT system using the technologies learned.

Requirements are:

  1. Choose two data source (at least) with 1k records min.
  2. Publish a dashboard or notebook analyzing it (last is mine, just to do a full stack project)

I will be probably building the capstone project around COVID-19, giving the current situation.

Photo by Markus Spiske on Unsplash

Review, conclusion and thoughts

I’ve really enjoyed this course. I’ve refreshed some concepts and learned some new. It took me around 1 week to complete it.

Overall, it is a good introduction about how data engineering world works and the importance of handling data that will be later delivered to business users.

Udacity quality is ok, some chapters could be improved, and content from others could be more in depth.

Project code was uploaded to my Github. Personally, I’ll move forward to learn more about Apache Spark.

--

--