ETL Engineering Trends in 2024

Thibaut Gourdel
3 min readJun 24, 2024

Hi, I’m Thibaut, and I write about data engineering and ETL. If you enjoy my content, consider following me on Medium.

ETL has been around for decades, and many successive innovation cycles have shaped the world of data engineering. ETL engineering, as a subset of the broader data engineering field, is experiencing renewed interest due in part to the rise of AI, where ETL processes have become paramount. Let’s take a look at a few key trends in the ETL space in 2024.

🐍 Python Hegemony

I’ve touched on this subject in my previous articles. ETL has long been dominated by Java-based tools and workloads. Python regained popularity due to its use in academic and research settings, boosted by multiple AI/ML waves (machine learning, deep learning, generative AI, etc.). As a matter of fact, most AI/ML libraries are Python-first, so the data world had to adapt. A glance at job description for data scientists, data engineers, and even data analysts roles reveals that Python is now one of the most requested skills in the data world, alongside SQL.

🪶 Small but Mighty Data

Following the 2010s wave of Big Data and its mixed success, came the realization that not everything is a big data problem. Jordan Tigani explained this well in his popular article “Big Data is Dead.” At the same time, the rise of powerful Python libraries, first Pandas and then Polars and DuckDB, pushed the boundaries of what can be done on single machines. Using these frameworks can actually get you very far (for a fraction of the cost) before needing distributed workloads on multiple machines. Trends aside, using the right tool for the job (and for you) should always be your first concern.

📄 Unstructured Data

With the rise of Generative AI, the ability to process vast amounts of unstructured data, mostly untapped by companies, has become possible. For instance, Retrieval-Augmented-Generation (RAG) pipelines enable companies to index enterprise documents and feed them to LLMs to respond to specific questions more precisely. LLMs can also extract relevant information from those documents and provide it in a structured format for analytical usage. Overall, this opens many more opportunities for companies to leverage their proprietary data for various use cases, both internal and external.

💡 ETL Development Powered by GenAI

Generative AI has many impacts on ETL, one of which is that it significantly lowers the barrier to developing pipelines for extracting and transforming data. LLMs are particularly good at writing code, especially Python code, due to the large portion of Python available in the open-source corpus they were trained on. For example, they are efficient at generating Selenium code (a popular framework for scraping websites) to extract data and structure it correctly. They can also write SQL queries, which, in addition to being useful for data analysts, is a game-changer for data engineers to write complex SQL queries faster.

🏠 Lakehouses and the Table Format War

As demonstrated by Databricks’s acquisition of Tabular and Snowflake’s new release of Polaris, table formats are all the rage. The Lakehouse architecture is gaining adoption, and data vendors are closely monitoring the situation and adapting promptly. Catalogs are getting more mature, and integration into common libraries and tools is largely underway. We should witness maturation, adoption, and perhaps consolidation in table formats in the next few years.

Some of these trends have been underway for several years already and will continue to mature in the coming years. Others, like GenAI for data and ETL engineering, are still in their infancy and are evolving quickly. In any case, adaptability and continuous learning are the only constants in this space!

Amphi ETL is a low-code and Python-based ETL tool for both structured and unstructured data. It allows you to develop data pipelines graphically and generate Python code that you own and can deploy anywhere. Amphi is free and open, give it a try!

Github: https://github.com/amphi-ai/amphi-etl

--

--

Thibaut Gourdel

I write about data engineering and ETL. I'm building Amphi, a low-code python-based ETL for data manipulation and transformation.