A Beginner’s Guide to Data Engineering — Part II

Data Modeling, Data Partitioning, Airflow, and ETL Best Practices

Image Credit: A transformed modern warehouse at Hangar 16, Madrid (Cortesía de Iñaqui Carnicero Arquitectura)


Part II Overview

Data Modeling

Image credit: Star Schema, when used correctly, can be as beautiful as the actual sky

Data Modeling, Normalization, and Star Schema

The star schema organized table in a star-like pattern, with a fact table at the center, surrounded by dim tables

Fact & Dimension Tables

Normalized tables can be used to answer ad-hoc questions or to build denormalized tables

Data Partitioning & Backfilling Historical Data

Data Partitioning by Datestamp

A table that is partitioned by ds

Backfilling Historical Data

The Anatomy of an Airflow Pipeline

Defining the Directed Acyclic Graph (DAG)

Source: A screenshot of Airbnb’s Experimentation Reporting Framework DAG

Operators: Sensors, Operators, and Transfers

A Simple Example

Graph View of the toy example DAG

ETL Best Practices To Follow

Image Credit: Building your craft takes practice, so it’s wise to follow best practices
A skeleton of stage-check-exchange operation (aka “Unit Test” for data pipelines)
Source: Maxime, the original author of Airflow, talking about ETL best practices

Recap of Part II

Data @Airbnb, previously @Twitter. Opinions are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store