A semi-fictional history of ETL
In the beginning we lovingly hand-coded our ETL jobs in our favourite IDE.
We scheduled our jobs to run once a day, week or month.
And life was good.
It wasn’t long before managers demanded their reports more frequently, which meant “their” data needed to be updated more frequently.
So we refactored our code and pushed our ETL jobs to the limit, running them as frequently as the hardware and software would allow.
Data volumes continued to grow and many changed from ETL to ELT, reducing the number of moving parts but pushed all the heavy lifting to the data warehouse. Some of us ended up with expensive proprietary real-time ETL solutions and changed over to template driven GUI tools which generated our ETL code for us. Others bought expensive hardware appliances or in-memory options. We parallelised our jobs where possible, scaled up the hardware to whatever budget would allow, and again, we pushed our ETL (or ELT) to the limit.
Data volumes were relentless but life was better… for now.
Managers were happy… for a while.
Until, they demanded that every dashboard and every report must be real-time, and mashed up with every open-source dataset under the sun. Oh, and let’s reduce TCO while we’re at it!
And the very fabric of the data warehouse was again put under-pressure, because neither the star schema nor the RDBMS were optimised for highly concurrent writes. They were, by design, optimised for reads, and the epic battle for system resource waged on.
Life was not good. Something wasn’t right. We had built a monolithic, GUI driven, proprietary-bound, beast.
Around that time, open-source and big data tools were becoming known to the data warehousing community and dare I say growing in popularity. Developers were were happy to regress 15 years by hand-coding their ETL jobs again, trading ACID guarantees for eventual consistency, using languages such as Python or Java. Jokes were made about the technologies named after pet toys and farm yard animals, and data was unthinkably stored in file systems in unheard of formats. There was a new kid on the block too, called Spark, a distributed cluster-computing framework which supposedly allowed you to run massive ETL workloads in memory across thousands of nodes. Even a new term emerged — ETL offloading. However little mention was ever made of real-time ETL until we started to hear about streaming, lambda and kappa architectures. Then finally streaming ETL examples began emerge, like these from Netflix and Confluent. Confluent was in fact so passionate about this, they were bold (and impartial) enough to state, “ETL is dead: long live streams”.
Baffled by which direction to take, we took one look at Spark Streaming (which was the most popular and available at that time) but saw the challenges of learning RDDs and DStreams (let alone Scala) and the expertise required to deploy and manage a Spark cluster. So, despondent and still disillusioned, we went back to the monolithic beasts we had created and life continued sub-optimally.