Beginner’s Guide: Extract Transform Load (ETL)Playbook — Full and Incremental Load
Purpose: The goal of this article is to give an introductory guide on some basics of ETL as it relates to Data Engineering.
One of the most important and often overlooked core facets of data engineering is the creation of ETL pipelines. With the popularity of AI and ML projects and the concentration of demand for data scientists. It is easy to deem ETL as an old-fashioned approach to modern data analytics solutions. I have seen so many training programs overlook or give little attention to this subject area when teaching students about data engineering. ETL or ELT is actually more fundamental and necessary before any AI/ML or Data analytics project can be kicked off. The larger the organization the more important is the role of ETL jobs to the company’s Data Analytics teams.
Now let us talk about some of the core principles in building an ETL pipeline. These concepts are tool agnostic but essential.
Extract and Load
The most basic part of ETL is the act of moving data from one or more source systems to a destination system(s).
What are we extracting and loading?
Data of all kinds of forms can be copied from one source to a destination. Example copying tables from SQL server database to a Data lake storage.
Why are we extracting and load from source to destination system in analytical projects?
1. We want to reduce the resource contention on any source system when we run analytics, aggregation, or computations. Therefore, we prefer to move this data from the source system where the data is originally stored to another system that will be optimized for analytics
2. We want to combine multiple source system data into a single target destination database so that we can perform more insightful analytics across various data sources.
3. Most source systems are not designed with analytics in mind. Therefore, running analytical queries on the source system will cause huge performance issues.