Data Engineering — ETL or ELT

Gaurav Sharma
4 min readSep 4, 2021

--

Business landscape in today’s world is changing at pace and data is extremely important element for business growth and success. The data is found is raw form and is of immense value for business. Traditionally business used to keep their online transaction databases separate from reporting or BI databases commonly known as data warehouses. Data warehouses are repositories for structured and filtered data that has been processed for a specific purpose. Typically (not always though), an Extraction Transformation and Load (ETL) process populates the data in Data Warehouse. However, the change in technology and business landscape has resulted in increased amount of data to be analyzed and changed the target audience of data. Data is no longer used by just the operational users now a days; the data is also used by data scientists for predictive modelling and statistical analysis.

Both the above factors (along with few others) have led to creation of a slightly different data storage repository called Data Lakes. A Data Lake is a storage repository that stores a large amount of structured, semi-structured and unstructured data. The data is usually stored in its raw or native format. Typically (not always though), an Extraction Loading and Transformation (ELT) process is used to populate and make use of data in Data Lakes.

In both the approaches viz. ETL and ELT, three steps are performed.

Extract — Data is extracted from source system(s). It can be done from structured databases or from unstructured data sources having images, videos, etc.

Transform — In this step, the data is cleaned, processed and then converted into the format in line with the existing format in the storage destination.

Load — Data is loaded into the storage destination

The only albeit very important difference between ETL and ELT is in the sequence of transformation and loading. In ETL, the data is transformed before loading whereas in the ETL, the data transformation is done after loading.

Figures below are to illustrate the sample diagrams for ETL and ELT which should help in understanding the concept better.

Let’s discuss the advantages of both the approaches

ETL:

1. Availability of processed data — With ETL, we obtain a ready for analysis data warehouse for quick data analysis as the transformation happens before loading the data into data warehouse. ETL is best suited for dealing with smaller data sets that require complex transformations.

2. Compliance and Regulations — Standards like GDPR and HIPPA are easier to implement with ETL given that data engineers can omit any sensitive data prior to loading in the destination data warehouse.

3. Well established framework — ETL has been in industry for couple of decades now and there are well developed tools and processes available. Industry is no short of professions having solid experience in ETL

ELT:

1. Flexible — With ELT, we can store any type of data (structured, semi-structured or unstructured) in the data lake even if it cannot be transformed due to complexity or time constraints for now. Relevant data loading processes can be developed as required.

2. Efficient — With ELT, we need lesser time and resources as the data can be transformed and loaded in parallel. It can handle huge data size for big-data analytics.

3. Availability of raw data — With ELT, we can load all data immediately, and analysts/users can determine later which data to transform and analyze.

4. Cost — With ELT, we can load and save the data without incurring large fees, then apply transformations as needed. This can save money on initial costs if we just want to load and save information for initial analysis before agreeing upon the transformed data requirements.

5. Resilience — The raw data is stored in Data Lakes for larger retention period, and hence the recovery process in event of an issue with transformation pipeline/logic will be relatively easier in comparison to ETL where the data would need to be obtained from source systems as the staging areas are typically not retained for more than just a couple of days.

The above data points make us familiar with the factors which should be considered while choosing the one of two. As already mentioned, the increase in amount of data to be analyzed, increased appetite of performing data analysis on semi-structured and unstructured data, and evolution of Cloud based data warehouses are all making the data engineers to start evaluating and following the ELT approach.

There is another hybrid approach known as ‘EtLT’ also being adopted by organizations now a days which does a light data transform before storing the data in data lake or data warehouse. This helps especially those use-cases where the data to be loaded is sensitive in nature and should go in-line with compliance mandates like GDPR and HIPPA. This first ‘t’ of EtLT doesn’t integrate two data sources and this stage is added only to allow data cleansing and masking in order to stay compliant whilst handing on data source at a time. Again, there is no right or wrong approach and which one to follow purely depends upon the requirements we are working for.

--

--