A Beginner’s Guide to Data Engineering — Part II

Data Modeling, Data Partitioning, Airflow, and ETL Best Practices

Robert Chang
Feb 20, 2018 · 12 min read
Image Credit: A transformed modern warehouse at Hangar 16, Madrid (Cortesía de Iñaqui Carnicero Arquitectura)

Recapitulation

Part II Overview

Data Modeling

Image credit: Star Schema, when used correctly, can be as beautiful as the actual sky

Data Modeling, Normalization, and Star Schema

The star schema organized table in a star-like pattern, with a fact table at the center, surrounded by dim tables

Fact & Dimension Tables

Normalized tables can be used to answer ad-hoc questions or to build denormalized tables

Data Partitioning & Backfilling Historical Data

Data Partitioning by Datestamp

A table that is partitioned by ds

Backfilling Historical Data

The Anatomy of an Airflow Pipeline

Defining the Directed Acyclic Graph (DAG)

Source: A screenshot of Airbnb’s Experimentation Reporting Framework DAG

Operators: Sensors, Operators, and Transfers

A Simple Example

Graph View of the toy example DAG

ETL Best Practices To Follow

Image Credit: Building your craft takes practice, so it’s wise to follow best practices
A skeleton of stage-check-exchange operation (aka “Unit Test” for data pipelines)
Source: Maxime, the original author of Airflow, talking about ETL best practices

Recap of Part II


Robert Chang

Written by

Data @Airbnb, previously @Twitter. Thoughtfully opinionated, weakly held. Opinions are my own.