A Beginner’s Guide to Data Engineering — Part II

Data Modeling, Data Partitioning, Airflow, and ETL Best Practices

12 min readFeb 20, 2018

Image Credit: A transformed modern warehouse at Hangar 16, Madrid (Cortesía de Iñaqui Carnicero Arquitectura)

Recapitulation

In A Beginner’s Guide to Data Engineering — Part I, I explained that an organization’s analytics capability is built layers upon layers. From collecting raw data and building data warehouses to applying Machine Learning, we saw why data engineering plays a critical role in all of these areas.

One of any data engineer’s most highly sought-after skills is the ability to design, build, and maintain data warehouses. I defined what data warehousing is and discussed its three common building blocks — Extract, Transform, and Load, where the name ETL comes from.

For those who are new to ETL processes, I introduced a few popular open source frameworks built by companies like LinkedIn, Pinterest, Spotify, and highlight Airbnb’s own open-sourced tool Airflow. Finally, I argued that data scientist can learn data engineering much more effectively with the SQL-based ETL paradigm.

Part II Overview

The discussion in part I was somewhat high level. In Part II (this post), I will share more technical details on how to build good data pipelines and highlight ETL best practices. Primarily, I will use Python…

A Beginner’s Guide to Data Engineering — Part II

Data Modeling, Data Partitioning, Airflow, and ETL Best Practices

Recapitulation

Part II Overview

Written by Robert Chang