Whispering Data
Published in

Whispering Data

The End of ETL As We Know It

If you’re as sick of this three-letter phrase as I am, you’ll be happy to know there is another way.

Take a Look Around You…

If you work in data in 2021, the acronym ETL is everywhere.

Ask certain people what they do, and their whole response will be “ETL.” On LinkedIn, there are thousands of people with the title ETL Developer. It can be a noun, verb, adjective, and even a preposition. (Yes, a mouse can ETL a house.)

Standing for “Extract, Transform, and Load,” ETL refers to the general process of taking batches of data out of one database or application and loading it into another.

Data teams are the masters of ETL as they often have to stick their grubby fingers into tools and databases of other teams — the software engineers, marketers, and operations folk — to prep a company’s data for deeper, custom analyses.

The good news is that with a bit of foresight, data teams can remove most of the ETL onus off their plate entirely. How is this possible?

Replacing ETL with Intentional Data Transfer

The path forward is with ITD or Intentional Transfer of Data. You see, the need for ETL arises because no one builds their user database or CMS with downstream analytics in mind.

Instead of making the data team select * from purchases_table where event_date > now() — 1hr every hour… you can add logic in the application code that first processes events and emit them in a pub/sub model.

Example IDT architecture on AWS with a real-time Lambda consumer + durable storage to S3 | Image by author

With no wasted effort, the data team can set up a subscriber process to receive these events and process them in real-time (or store them durably in S3 for later use). All it takes is one brave soul on the data team to muster the confidence to ask this of the core engineering squad.

Ten years ago, I get it. Data teams were beginning to establish their capabilities and needs, and such an ask might be met with a healthy dose of resistance.

A decade later, however, that excuse no longer flies. And if you are on a data team doing solely traditional ETL on internal datasets, it’s time you upped your game.

Beyond the obvious benefit of avoiding costly, inefficient ETL processes, there are several other benefits to IDT.

1. IDT Forces Upfront Agreement on a Data Model Contract

How many times has one team changed a database table's schema, only to later learn the change broke a downstream analytics report? Any analytics veteran will tell you it’s a data tale as old as time!

Frankly, it is difficult to establish the cross-team communication and awareness necessary to avoid these issues… when you have ETL scripts running directly against raw database tables.

Instead with IDT, when an event occurs, it will be published with certain fields always present that are previously agreed upon and documented. For example, a purchase by a customer might look like this:

{
"event_name": "transaction",
"user_id": 12345,
"event_action": "purchase",
"action_object": "gift_card",
"event_timestamp: "2021-01-02T03:04:05+01:00",
...
}

Any changes engineering makes in the purchases table schema will not affect the fields in the IDT publisher's events. And everyone should know that any change to this JSON model contract needs to be communicated first.

2. IDT Removes Data Processing Latencies

Most frequently, ETL jobs are run once-per-day overnight. But I’ve also worked on projects where they’ve run incrementally every 5 minutes. It all depends on the requirements of the data consumers.

What doesn’t change is that there will always be some latency between an event occurring and the data team receiving it. This latency only limits what one can do with the data and introduces tricky edge cases to any data application.

With IDT, however, events are published immediately as they happen. Using real-time services like Amazon SNS, SQS, and Lambda, they can be responded to immediately.

It does not necessitate you implement a streaming-based process, but at least you have the option.

Taking The First Steps

Moving from ETL to IDT isn’t a transformation that will happen for all your datasets overnight. Such an all-encompassing change would be overwhelming.

Taking one dataset to start, though, and setting up a pub/sub messaging pattern for it is extremely doable. Make a Hackathon out of it if you need to.

My advice is to find a use case that would most benefit from real-time data processing — whether it’s a feed of users’ current locations or cancellation events — then transition it from ETL to the IDT pattern.

And maybe one day, the phrase ETL will never be uttered in your presence again!

--

--

--

Whispering Data is a Medium publication for all the data & productivity secrets you wish you knew years ago!

Recommended from Medium

COVID-19 Data Dynamic Visualiztion

Starting Anew with Data Science

21 Tips for Every Data Scientist for 2021

Creating a new movie studio: Exploratory data analysis of movie data.

The Future of Automated Data Lineage in 2021

Automated Data Lineage in 2021

The Journey into Data Science

An Introduction to SQL

Top 10 Websites & Resources for Your Data Science Job Hunt

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Singman

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

More from Medium

Don’t Let your Airflow Metadata just sit there! Start Min(d)ing it.

Level Up Your Data Lake

The Key Feature Behind Lakehouse Data Architecture

Data testing & data monitoring, do those need to be separate things?