The Different Types of Loads in ETL

Ankush Singh
4 min readJul 26, 2023

--

ETL load

The world of data warehousing and ETL (Extract, Transform, Load) processes can sometimes feel like a dense forest with many branching paths. One pivotal decision every data engineer faces is determining how to load data into a target system. The approach chosen can significantly impact the efficiency, accuracy, and costs of your ETL processes. Let’s navigate this landscape by exploring the different types of loads in ETL.

1. Full Load:

When we talk about the most straightforward type of loading, ‘Full Load’ reigns supreme.

  • What is it? As the name suggests, every time the ETL process runs, it loads the entire data set from the source system into the target.
  • When to use it? Best for when the data volume is not overwhelming, or when setting up a new data warehouse or target system.
  • Example: A startup begins its operations and decides to consolidate all its historical data from various Excel sheets into a centralized data warehouse. It performs a full load to get all this data into the system at once.

2. Incremental Load (or Delta Load):

This is the minimalist’s approach to loading.

  • What is it? Only new or modified records since the last ETL cycle are moved from the source to the target.
  • When to use it? Perfect for vast datasets, where loading everything repeatedly isn’t practical.
  • Example: A retail chain collects daily sales data. After an initial full load, it decides to add only the new sales data to the warehouse daily. Hence, every night, only the sales data from that particular day gets transferred.

3. CDC (Change Data Capture) Load:

For those who need a keen eye on changes, CDC comes to the rescue.

  • What is it? A technique to identify and capture changes in source data. It tracks inserts, updates, and deletes to ensure the target system remains in sync with the source.
  • When to use it? When maintaining real-time or near-real-time synchronization between source and target is crucial.
  • Example: A CRM system tracks client interactions. To ensure a data warehouse remains updated with all CRM changes, including updates and deletions, a CDC mechanism is used.

4. Append Load:

Some data is written in stone — or at least should be.

  • What is it? New records are added to the target, but existing ones remain untouched.
  • When to use it? Ideal for immutable datasets, like log file processing.
  • Example: An application logs user activities. The logs are immutable, and so each new log entry is simply appended to the data warehouse.

5. Mirror Load:

For those who crave an exact reflection.

  • What is it? The target is an exact replica of the source. Any change in the source, even deletions, reflects in the target.
  • When to use it? When a complete, mirrored backup of the source is necessary.
  • Example: A backup system is set up to be an exact replica of the primary system. Any changes, additions, or deletions on the primary are immediately reflected in the backup.

6. History-Preserving Load:

History buffs, rejoice!

  • What is it? It retains a trail of all data changes. Instead of simply updating a record in the target, a new, timestamped record is added.
  • When to use it? When tracking the historical evolution of records is essential, especially for audits or trend analyses.
  • Example: An employee database tracks salaries. Instead of overwriting a salary when it changes, a new record is created with a timestamp. This way, salary trends over time can be observed.

7. Snapshot Load:

Capturing a moment in time.

  • What is it? It captures the entire current state of data at periodic intervals.
  • When to use it? When point-in-time reporting is essential, such as understanding the state of inventory at month-end.
  • Example: A financial institution captures the state of all investments at the end of each month. At each month-end, the complete state of investments is saved as a snapshot.

8. Restatement Load:

Everyone deserves a second chance, and so does data.

  • What is it? Portions of previously loaded data are corrected or reloaded because of errors or modifications.
  • When to use it? When data accuracy is paramount and errors in the previously loaded data are discovered.
  • Example: An e-commerce platform found an error in its discounting logic for the past month. They decide to reload the affected data for that month to correct the error.

9. Upsert Load (Update + Insert):

Making decisions on-the-fly.

  • What is it? If a record exists in the target, it’s updated; if not, a new one is inserted.
  • When to use it? When you’re unsure about the existence of records and need to avoid duplicates.
  • Example: A user profile system. If a user updates their profile picture, that record is updated. If a new user signs up, a new record is added.

10. Bulk Load:

Moving homes? Get the largest truck.

  • What is it? Loads a vast amount of data swiftly, often by bypassing some regular checks.
  • When to use it? During initial setups or when there’s a need to move a significant volume of data quickly.
  • Example: An online forum decides to migrate all its historical forum posts and user data to a new system. They use a bulk load to swiftly transfer this vast amount of data.

Conclusion

Understanding the various types of loads in ETL is essential for making informed decisions that best align with your organization’s goals, operational demands, and budget constraints. Each method has its benefits and challenges, so carefully evaluate your specific needs and the nature of your data to choose the optimal approach.

--

--

Ankush Singh

Data Engineer turning raw data into gold. Python, SQL and Spark enthusiast. Expert in ETL and data pipelines. Making data work for you. Freelancer & Consultant