Designing Incremental data extraction for a source data that is a SQL query

Nnaemezue Obi-Eyisi
Geek Culture
Published in
5 min readJan 18, 2023

--

The goal of this article is to describe how we can design and implement an incremental ETL pipeline that extracts data from a source query (instead of the common case of a table) and load it to a Target table.

Often times, most people are knowledgeable in designing incremental data extraction (ETL) pipelines that pull only the recently changed records from tables in a Source Database and apply it to a Target Database. Please refer to my previous article about incremental ETL pipelines

However, many people struggle with performing incremental load when the data source is not a table but a SQL query.

For example, let’s imagine you have a source database with 5 tables and you need these tables to refresh your target data mart database that feeds a reporting tool. As a Data engineer, you notice that all the source tables have a “LastModifiedDate” field that is maintained by the source system and can be relied on to keep track of any inserts or updates to the tables. Hence, you utilize this “LastModifiedDate” field as your watermark field to create an incremental (delta) extraction data pipeline. You decided to first create a Staging Database to land all your source data in a central location, because you could have multiple data sources in the future and wanted a centralized storage location.

In order to populate your data mart Fact table, you leverage some business logic that uses the tables in the staging…

--

--

Nnaemezue Obi-Eyisi
Geek Culture

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini