From ETL to ELT
In the early 2000s, we had a minimal amount of data and data source types to use. Some basic transaction data based on the operations such as purchase, click, sign up, and master data about the customers. Database and data warehouse admins were creating some transformation scripts at PERL and Bash to embed SQL scripts to do some data cleaning, transformation and storing in the databases for one or two departments to use at some reporting and data science projects.
Today things changed! Every single day we produce more data compared to yesterday. Not only we produce more data but also we produce different types of data such as image, video, text, speech, tabular, etc. Every single department in the organizations started to taking advantage of the stored data.
Since the data producing and consumption patterns changed over the years, it was inevitable to change how we move data from A to B location. Today, we are going to take a look how ETL lost its crown to ELT over the years.
Extract, Load and Transform
Either we go with ETL or ELT approach, the actions to be taken are identical. Only difference is in which order we do execute the flow. Before diving into the comparison, let’s take a look into steps in the process.
- Extract: Data is extracted from various sources, such as databases, data warehouses, file systems, APIs, which could be in different formats and structures.
- Load: The transformed data is loaded into the target system, such as a database, data warehouse, or data lake, making it available for downstream users such as reporting, advanced analytics, data science and machine learning.
- Transform: The extracted data is then cleaned, enriched, and transformed into a standardized format, ensuring consistency and compatibility with the target system. This step might involve data validation, aggregation, filtering, table joins, conversion of data types or units, creating new tables, etc.
Why ETL Dominated the Market in The Past?
ETL was the procedure every database and data warehouse admin used over the years. We developed ETL pipelines, we created job descriptions to seek people with building ETL pipelines, we educated our stakeholders that when the ETL pipelines fails they can’t see the new data in the reports. But why ETL dominated in the past?
Cloud
In the early 2000s, almost all the databases and data warehouses were deployed at on-premise systems. You had to buy the server, disk, CPU and set up the whole infrastructure with your vendor and network people in your company.
Cost
Similar to the cloud, cost was the second biggest problem. Disks were getting cheaper every single day but RAM and CPU were quite expensive for data warehouses. Increasing RAM or CPU by 25% might cost you something similar to set up the server from the scratch.
Not many use cases
In the lack of variety of data sources and types, the options were quite limited to show creativity. Only limited amount of teams were seeking the data to make analysis and get benefit from it; such as finance, marketing and sales. Also most of the uses cases were basic reports.
When there was not much room to increase the resources in the data warehouse due to on-premise bottleneck and high costs, also not many different requests and uses cases were present, we try to put all the transformation steps in a single batch.
Why ELT Overthrew ETL?
Demand always shape the supply in every market and condition!
As the data size and variety increased over the years, different teams and departments started to be interested in data. Demand is created! Database and data warehouses changed their titles into Data Engineer with better tools in their toolkit. Supply is on the way, right? Not really. Things got worse and worse for Data Engineers. Let’s see why.
End of One-Size-Fits-All ETL Jobs
The easiest way to prove that you are a hands-on data person is to talk about your KPI alignment battles between Marketing, Sales and Product departments. The definition of a customer is different for each of them. The definition of retention and churn is different from them. When there is no alignment between on the teams, how you are going to create a ETL job just to aggregated everything to show the revenue?
The more departments started to use data to make decisions, the more discrepancies created. In a discussion everyone is correct but no one wants to align with each other, you create different solutions for each one of them. This triggered the requirement of different ELT jobs! Load the data to the data warehouse and do the transformations in which way they need!
Increasing Business Logic Complexity
Every single day the business logic complexity increased and increased and increased! As it kept getting more complex, data engineers became pipeline developers to add/ remove or modify some transformation logic. By the time the ETL code refactoring became almost impossible to maintain. We found ourselves in situations that making a change over a single column definition would break the whole ETL and analytics pipelines. We found solutions not to do ETLs any more but dump data into data warehouses and to transformations at there at various data pipelines.
Cloud
Today spinning up a data warehouse takes less than 30'. Increasing its resources to whatever you need is less than 5'. Serverless data warehouses support automated scale up and down capabilities depending on the load without needing a person as long as the boundaries are set. In such a situation, companies started to be less conservative about the resources. Every single team started to write their own transformation scripts in data warehouse and build pipelines. So principle of parsimony became a legend!
How Does the Future Look Like?
The future is ELT! I think we came to a stage that the freedom gathered with ELT can not be taken back from the teams. It is not a methodology we are talking about but also the companies around the world have been investing into ELT as well. Stitch, FiveTran, dbt, Airbyte and many other start ups mainly focusing on empowering their ELT capabilities.
This doesn’t mean that ETL is dead! ETL is going to be there as long as data movement is still a problem. It will be only used for small data movements, specific use cases and less frequently.