Building an Efficient ETL/ELT Process for Data Delivery

Mariusz Kujawski
15 min readApr 22, 2024

--

“Data is the new oil” — a statement that has underscored our approach to data management over the past decade. We’ve dedicated ourselves to gathering, processing, and analyzing vast volumes of data to fuel our journey towards becoming a data-driven organization. However, with the rise of ChatGPT, collecting over 1 million users in just 5 days, the landscape has shifted. This phenomenon has sparked new ideas around leveraging machine learning (ML) and artificial intelligence (AI) to drive our business forward. As we transition towards an AI-driven organization, our decision-making must be informed not only by current data but also by forecasts and predictions.

The evolving needs and the advent of cutting-edge tools like ML and AI require robust data platforms capable of delivering timely and comprehensible data to support our evolving data processes. Similar to managing other resources within our organization, we must construct infrastructure and processes that ensure efficient data delivery.

In this post, I’ll outline strategies for constructing a resilient ingestion and ETL/ELT (Extract, Transform, Load / Extract, Load, Transform) process to facilitate seamless data delivery for our data platform.

Illustrated by author

Data Transformation Types

ETL — Extract, Transformation, Load

ETL is the most common approach to processing data in an on-premises data platform. In this scenario, an ETL tool is utilized to extract data from sources such as relational databases, flat files, and APIs, among others. The extracted data is then transformed on the fly and saved to a data warehouse. Popular ETL tools include Informatica PowerCenter (IPC), Microsoft SQL Integration Services (SSIS), SAP Data Services, Talend, and IBM Data Stage. Alternatively, data transformation can be achieved using custom scripts in Python or another programming language.

One common setup involves having a separate ETL server dedicated to processing data and loading it into a database serving as a data warehouse. This segregation ensures that heavy transformation queries do not burden the data warehouse.

ETL process

Another approach to ETL processing involves utilizing a staging area to store extracted data before transformation. This allows for more efficient data processing and enables two variants:

  • Staging data for ETL tool purposes to reduce required memory and compute power.
  • A hybrid approach involving initial data preprocessing using an ETL tool followed by further transformation using SQL queries before loading the data into a data model in the data warehouse. This approach is particularly useful when the ETL tool encounters performance issues or lacks sufficient resources.
ETL process with staging

ETL Tools

Advantages of ETL Tools:

  • Data flow visualization in graphical form facilitates easier understanding of the process.
  • Simple processes can be quickly created using ready-to-use components via a graphical interface.
  • One tool for data integration and transformation simplifies the workflow.
  • Monitoring and error handling capabilities enhance data management.

Disadvantages of ETL Tools:

  • Transformations stored in XML or JSON can pose challenges when working with version control software.
  • Difficulty in unit testing data flows.
  • Significant effort required to cover missing features.
  • Limited scalability.
  • Potential vendor lock-in.

ELT — Extract Load Transformation

The ELT scenario is commonly employed in a two-tier architecture consisting of the Data Lake and the Data Warehouse. The Data Lake serves as a repository for collecting data from various sources, storing it in structured, semi-structured, or unstructured formats. This flexibility enables the storage of data formats that may not be effectively stored in a traditional database. In this approach, data ingestion tools are used to import data from sources to the Data Lake. Many ingestion tools are available on the market, including open-source options, those provided by cloud providers, licensed software, and pay-per-usage tools.

Once the data is in the Data Lake, the next step is to load it into a database. This can be achieved using built-in loaders, custom applications, or applications provided by cloud providers. Another option is to use external tables that create an SQL interface to access data in files, making it easy to query data from the Data Lake using SQL.

Data transformation is then performed by a database engine using SQL to create a data model. In the cloud, Massive Parallel Processing (MPP) engines such as Azure Synapse, AWS Redshift, BigQuery, or Snowflake can be utilized for efficient data transformation. Modern frameworks like dbt or GCP Dataform can also be employed for managing SQL-based transformations.

ELT with data Lake

An alternative method of data ingestion to a Data Lake is the Push method, where external sources extract data directly to the Data Lake. However, this approach may result in limited control over data extraction, requiring coordination with teams responsible for data sources in cases of missing or corrupted data.

Advantages of ELT:

  • Separation of ingestion and transformation processes, enhancing scalability and flexibility.
  • Good scalability in the Cloud environment, allowing for vertical and horizontal scaling.
  • Transformations written in SQL can be easily stored in a repository, tracked for changes, and subjected to code review techniques.
  • Possibility to utilize different tools for different activities, such as using paid tools for extraction from specific sources and SQL or Apache Spark for data transformation.
  • Limited vendor-lock due to the use of SQL or open-source software for data transformation.
  • More generic and metadata-driven processes, enabling the creation of generic pipelines for extraction based on metadata descriptions.

Disadvantages of ELT:

  • Complexity introduced by multiple tools in the data processing process.
  • Lack of data anonymization, masking — Data extracted as it’s.

EtLT- Extract, transform, Load, and Transform

While the ELT process facilitates fast data extraction to a data platform, it comes with potential downsides, especially concerning sensitive information. Data extracted from sources may contain confidential data that should not be accessible to unauthorized individuals, or organizations may need to comply with standards such as SOC 2, GDPR, CCPA, and HIPAA. In such cases, it becomes necessary to take measures to remove, mask, or encrypt sensitive data before moving it to the data lake or data warehouse.

To address these concerns, we can implement an “EtLT” process, which includes an additional transformation step before loading data into the data platform. This “light” transformation occurs during the extraction process and involves masking, filtering, or encrypting sensitive information. Unlike heavy transformations like joins or aggregations, this step focuses solely on specified data sources and fields, ensuring that sensitive information is protected without introducing significant processing overhead.

By incorporating these “light” transformations into the extraction process, organizations can prevent the storage of sensitive information in their Data Lake or Lakehouse, thereby enhancing data security and compliance with regulatory standards.

EtLT process

Stream Processing

Stream processing enables real-time data ingestion and analysis from various sources, including IoT devices, sensors, events, applications, and relational databases. Depending on the nature of the data, incoming streams can be transformed through aggregation, time window analysis, or individual record processing. This real-time data can be stored in files, databases, or utilized for live reporting and analytics.

Streaming

Lambda Architecture

The Lambda architecture is a versatile data deployment model that combines both batch and stream processing pipelines to handle dynamic data requirements effectively.

Speed Layer: This layer processes continuous streams of data in real-time or near real-time using technologies like Apache Spark or Storm. Data processed in this layer is swiftly analyzed within milliseconds to seconds, enabling real-time analytics and insights.

Batch Layer: In contrast, the batch layer divides incoming data into smaller batches for processing. While batch processing may be slower, taking several hours to complete, it excels in processing large datasets and providing detailed insights. Processed batch data is then stored in the Service layer for storage and querying.

CDC — Change Data Capture

Change Data Capture (CDC) is an innovative approach to real-time data transformation, particularly useful for capturing and processing database changes on the fly. By monitoring the database transaction log, CDC software identifies new records or modifications (such as INSERT, UPDATE, or DELETE operations) and converts them into streaming events. These events can be consumed by applications and services for further processing, transformation, and loading into data warehouses or data lakehouses. This real-time data ingestion enables timely data analysis and facilitates integration with machine learning models for predictive analytics.

CDC

The Data Processing Process in the Lakehouse

The Lakehouse architecture combines the strengths of both data lakes and data warehouses, offering a unified platform for data storage, processing, and analysis. In the Lakehouse, Apache Spark serves as the primary tool for data processing and exploration, capable of extracting data from various sources and performing complex transformations.

Data in the Lakehouse is stored in file formats such as Delta or Iceberg, which support data modification and transactions, while also enabling the separation of computing and storage. This architecture ensures flexibility and scalability, allowing organizations to efficiently manage large volumes of data.

To populate the Lakehouse platform with data, organizations can either extract data directly using Apache Spark or utilize ingestion tools to extract data to the Bronze area of the Lakehouse. Subsequently, Apache Spark is employed to preprocess the data, perform filtering, cleaning, enhancement, and store it in the Silver layer. Finally, data transformation, similar to traditional data warehouses, is performed to populate a business data model.

Unlike the two-tier architecture where data is moved to another system after extraction, in the Lakehouse, all data remains stored in storage as files. This design choice enables seamless access to data through various solutions capable of reading file formats accessible in storage. Additionally, a metadata layer stores information about files, facilitating data querying and management by Apache Spark.

The Lakehouse architecture offers organizations a robust and versatile solution for managing and analyzing data, combining the flexibility of data lakes with the structured querying capabilities of data warehouses.

Data Loading and Ingestion Strategies

The effectiveness of data loading and ingestion strategies depends on several factors, including the type of data source, data volume, network capabilities, source response speed, and the nature of connectors utilized. When undertaking the task of obtaining data from a source, it is important to take into account the time required by our ETL/ELT processes(especially when adhering to specific requirements for data availability in the case of batch processing).

Furthermore, it’s crucial to ensure that the ingestion process does not overload the source system with heavy queries. To achieve this delicate balance, thorough testing of various connectivity and ingestion strategies becomes essential. In some scenarios, opting for a full dump to CSV utilizing a native database tool may prove to be more beneficial than selectively extracting data using a connector, particularly when dealing with sizable datasets.

Full load

The simplest way to ingest data from a source is by importing all data from a table in a database or a file. This approach is utilized when selecting only part of the data is not feasible, such as when using a file as a source or a table where changes and new records cannot be easily identified. Due to these limitations, all rows from the source need to be imported. Depending on the complexity of transformations, this method may be the easiest way to insert data into warehouse tables, as tables are reloaded from scratch every time by overriding the destination table. However, it can be costly in terms of resources when dealing with a large number of rows.

Incremental load

For systems where new rows can be identified and data is immutable, such as accounting systems, transaction systems, and logs, it’s possible to ingest only new records from the source using a query that returns rows not yet present in the data warehouse. To maintain this process, a “ware mark” needs to be stored, such as the maximum ID of the last imported record. In this way, only new records are added to the data warehouse tables through insert or append operations, making the process efficient and fast.

Delta load

Delta loading is a data ingestion strategy employed when a source table contains metadata indicating record modifications along with corresponding timestamps. This scenario often arises in tables storing client information within Customer Relationship Management (CRM) applications. When changes occur in the CRM data, metadata columns reflecting these modifications are updated accordingly. Leveraging this metadata, a query can be sent to the source system to selectively extract records modified since the last extraction.

Maintaining this process necessitates storing information about previous extractions, such as timestamps or record identifiers, to accurately identify and extract delta changes. However, delta loading typically incurs higher costs from a data warehouse perspective. This is primarily due to the need for update operations, which can be computationally intensive and may require additional resources such as finding modified records for update, rebuilding tables or indexes in certain database architectures, or modifying files in delta formats.

To implement delta loading effectively, merge, update, and insert commands are commonly utilized. While more resource-intensive, delta loading presents a viable solution for handling large tables where optimizing the retirement and loading of data is essential for maintaining efficiency. By selectively ingesting only modified records, delta loading minimizes redundant data transfer and ensures that only relevant updates are incorporated into the data warehouse, thus saving time and resources in data processing pipelines.

Transformation Types

Data Cleansing

Data cleansing is the process of detecting and correcting errors, inconsistencies, and inaccuracies in raw data. This includes identifying and removing duplicate records, correcting misspellings and typos, standardizing formats, and filling in missing values. It’s common transformation process that we use to prepare raw data for usage. After data extraction from a source we can face many different issues in data. Usually, analysis of extracted data gives insight what should be update, replaced, or filtered out to improve quality and usability of data for machine learning and analytical purpose.

Data Normalization

Data normalization is a technique within database design aimed at structuring data to minimize redundancy and improve data integrity. By decomposing a database into multiple tables and establishing relationships among them, normalization ensures that each entity is stored uniquely, thus mitigating duplication and data inconsistencies. For instance, in a JSON file received from a source system containing client information and transactions, normalization involves creating separate tables for clients, transactions, and products. Without normalization, modifications to product names would require changes across every associated order, introducing potential errors and inconsistencies.

Aggregation

Aggregation represents a transformative process where data is consolidated to meet desired granularity levels. Particularly beneficial when dealing with extensive datasets where detailed data isn’t necessary, aggregation enhances performance and standardizes data granularity. Implementing data aggregation can significantly boost the efficiency of analytical queries across various scenarios.

Orchestration

Orchestration plays a pivotal role within the ETL/ELT process, particularly in the context of ELT where dependencies between steps in different tools must be managed. For example, employing Python scripts for data ingestion and SQL procedures for transformation necessitates an orchestrator capable of sequencing these steps accurately. The orchestrator must trigger the ingestion process and await its completion before commencing data transformation steps in the correct order. Additionally, it should incorporate essential features such as scheduling and retry mechanisms.

Fortunately, a plethora of orchestration tools are available on the market, including those provided by cloud vendors, offering seamless integration of diverse tools. However, challenges may arise when attempting to integrate a non-native tool with a cloud’s native orchestration environment.

In addition to managing orchestration, these tools often offer robust logging, process monitoring, and alerting functionalities, which are indispensable in complex data ecosystems. This enables swift identification of errors within the ETL/ELT process and facilitates timely process restarts when necessary.

Orchestration tools can be categorized into several types:

  • GUI-based with drag and drop predefined activities: These tools enable pipeline definition using a graphical user interface, with pre-defined activities such as data copy, lookup, and SQL commands.
  • Python-based Frameworks: Frameworks in this category provide ready-to-use Operators for activities like data import, shell commands, and data export. Custom code can also be written for specific activities.
  • YAML-based workflows: Workflows are defined using YAML definitions, leveraging pre-defined activities for orchestration.

Moreover, orchestration solutions may be server-based or serverless, with varying pricing models. For instance, Azure Data Factory charges based on execution time, GCP workflows on the number of executions, and GCP Composer for compute and service availability time. Open-source orchestrators can be installed on a virtual machine, though this entails maintenance and patching responsibilities. Alternatively, third-party providers offer serverless orchestrators like Prefect.

Selecting the most suitable orchestrator requires consideration of factors such as pricing, data volume, integration capabilities, flexibility, and portability. In many cases, opting for a cloud-native orchestrator may suffice, given its seamless integration with cloud services and simplified management.

Data Quality

Data quality is an indispensable aspect of data processing, ensuring the reliability and accuracy of data delivered to a data platform. Whether generating reports or training machine learning models, ensuring the integrity of data is paramount. Missing, corrupted, or duplicated data can impede operational efficiency and pose significant risks, particularly when reporting measures to regulators. To address these challenges, various data quality checks can be integrated into the ETL/ELT process within data lakes, data warehouses, or data lakehouses:

  • Data Validation: Employing custom scripts or frameworks, validation rules can be established to verify the integrity of incoming data. These rules assess whether data are in the correct format, ensure columns are not empty, validate the length of numerical values, or confirm that numerical values are non-negative. By conducting such tests at different stages of transformation, data quality can be improved at a granular level.
  • Data Profiling: Utilizing data profiling methods, general statistics are collected about processed sources, including the number of rows, averages, sums, standard deviations, and more. Analyzing this information enables the identification of missing or corrupted data within specific sources, facilitating proactive data quality management.
  • Data Reconciliation: Following the transformation of data into a required data model, data reconciliation processes can be implemented to ensure consistency and accuracy. By comparing aggregated values, such as the sum of money in client accounts between the source system and the data platform, potential discrepancies can be identified, confirming the fidelity of the data.

Additionally, Machine Learning algorithms, such as Clustering, can be leveraged for data quality enhancement. By utilizing clustering techniques, potential outliers — indicative of data quality issues — can be identified within data sets. Integrating such techniques into data processing pipelines offers an alternative approach to detecting and addressing potential data quality issues.

Clustering

In essence, robust data quality practices are essential for maintaining the reliability and integrity of data within organizations. By implementing a combination of validation, profiling, reconciliation, and advanced techniques such as clustering, businesses can enhance data quality assurance processes and mitigate risks associated with erroneous or incomplete data.

Master Data Management

In today’s business landscape, effective data management is paramount, and Master Data Management (MDM) stands out as a guiding principle for navigating the complexities of data governance.

At its core, MDM serves as a centralized repository for storing master data, focusing on two fundamental concepts: golden records and reference data.

  • Golden Records: These records epitomize data accuracy and consistency, acting as the ultimate source of truth for critical information such as customer details or product specifications. In scenarios where multiple source systems collect customer data, inconsistencies often arise. To address this challenge, organizations designate a “gold source” system with the highest data quality, or employ mechanisms to unify data within the MDM framework.
  • Reference Data: Reference data provides essential context and clarity, offering insights into hierarchies, classifications, and relationships within the data ecosystem. Typically provided by business users, reference data enriches imported data from source systems, facilitating analytical purposes and informed decision-making.

MDM empowers organizations to fully leverage their data assets by enhancing data quality and enriching data with essential business information. By harmonizing the data ecosystem and providing a unified view of data across the entire organization, MDM facilitates informed decision-making and drives business success.

Summary

In the realm of advanced analytics, business intelligence, AI, and machine learning, the criticality of data delivery cannot be overstated. It’s more than just establishing a pipeline; a well-designed process contains a series of crucial steps, spanning from data extraction and transformation to data governance and quality assurance. Whether initiating a new implementation or enhancing an existing data platform, it’s imperative to note the steps and techniques outlined in this post. By prioritizing effective data delivery strategies, organizations can ensure the seamless flow of high-quality data to power their analytical efforts and drive informed decision-making processes.

If you found this article insightful, I invite you to express your appreciation by clicking the ‘clap’ button or liking it on LinkedIn. Your support is greatly valued. For any questions or advice, feel free to reach out to me on LinkedIn.

--

--