A Beginner’s Guide to Data Engineering — Part II

Data Modeling, Data Partitioning, Airflow, and ETL Best Practices

Robert Chang
12 min readFeb 20, 2018

--

Image Credit: A transformed modern warehouse at Hangar 16, Madrid (Cortesía de Iñaqui Carnicero Arquitectura)

Recapitulation

In A Beginner’s Guide to Data EngineeringPart I, I explained that an organization’s analytics capability is built layers upon layers. From collecting raw data and building data warehouses to applying Machine Learning, we saw why data engineering plays a critical role in all of these areas.

One of any data engineer’s most highly sought-after skills is the ability to design, build, and maintain data warehouses. I defined what data warehousing is and discussed its three common building blocks — Extract, Transform, and Load, where the name ETL comes from.

For those who are new to ETL processes, I introduced a few popular open source frameworks built by companies like LinkedIn, Pinterest, Spotify, and highlight Airbnb’s own open-sourced tool Airflow. Finally, I argued that data scientist can learn data engineering much more effectively with the SQL-based ETL paradigm.

Part II Overview

The discussion in part I was somewhat high level. In Part II (this post), I will share more technical details on how to build good data pipelines and highlight ETL best practices. Primarily, I will use Python, Airflow, and SQL for our discussion.

First, I will introduce the concept of Data Modeling, a design process where one carefully defines table schemas and data relations to capture business metrics and dimensions. We will learn Data Partitioning, a practice that enables more efficient querying and data backfilling. After this section, readers will understand the basics of data warehouse and pipeline design.

In later sections, I will dissect the anatomy of an Airflow job. Readers will learn how to use sensors, operators, and transfers to operationalize the concepts of extraction, transformation, and loading. We will highlight ETL best practices, drawing from real life examples such as Airbnb, Stitch Fix, Zymergen, and more.

By the end of this post, readers will appreciate the versatility of Airflow and the concept of configuration as code. We will see, in fact, that Airflow has many of these best practices already…

--

--

Robert Chang

Data @Airbnb, previously @Twitter. Opinions are my own.