Sponsored by: Corsair’s TradeCraft (launching soon)

The Practical Explainer — ETL

Simplifying the landscape of analytics — one concept at a time

Decision-First AI
Published in
4 min readFeb 13, 2019

--

Do a quick Google search on “Data Collection” and you will find it is “more commonly known as ETL”. ETL is an acronym for Extract, Transform, and Load. The definition of common here is clearly subjective. It is fair enough to note that most IT shops use the term ETL. It is equally fair to note that term is not nearly as commonly understood as the Wikipedias of the world might insinuate.

It quickly gets more complicated. Most things in analytics do. It seems that quite a few folks want to change ETL to ELT. In their minds, order counts. This article will not debate that argument. If you want more — go here.

So lets start with:

ETL is data collection. Data Collection is ETL or ELT.

Then let’s break it down:

ETL was coined in the late 80’s, around the same time people started talking about data warehouses. I love history lessons, but this isn’t one. If you want more — go here. Just know it is a fairly recent term for a process that has existed since… well, Aristotle.

EExtraction. To be honest, not a great start. To be fair, data in the digital age isn’t really collected either. There are many common terms here — extracted, collected, scraped, mined, etc. It is a bit like a dental visit. Only these terms seem to imply that the data is taken from the source… that is almost never true. It is really just copied. It is a ctrl-c function — sometimes quite literally!

LLoad. I am not endorsing acronyms. It just easier to go here next. Once you copy the data, you want to put it somewhere where people or platforms can make use of it. Sometimes that never happens, but extracted data goes somewhere and that is the load portion.

Most often it is loaded into a data warehouse, a data lake, a data environment, or a data set. The nuances of these names are lost on even the most knowledgeable people. We will come back to that in a later article.

TTransform. Or is it transport? Oracle, for one, includes both. Between the E & L, the data is moved. Depending on the process, it may remain otherwise unchanged, be changed in transit, or be changed after arrival. Transform = changed. That is the real key.

How is the data changed? That is a great analytic question. The short answer is it may be filtered, formatted, defaulted, or any other number of manipulations done to either clean the data or make it easier to store. As an analyst — this part of the process should make the hair on the back of you neck stand up. It is where things often get broken… with the best of intentions.

Want the longer answer? Read more — here.

In practice:

If you highlight a table on a website, hit ctrl+c, and paste it into Excel — you just performed a very basic ETL function. Some would call it screen scraping and say it was only the E. But Excel can function as a poor man’s warehouse — so you cover the L as well. If you subsequently adjust the cell formats, you just did it all… though technically ELT.

https://www.talend.com/resources/etl-tools/

The tag ETL is most often connected to the IT tools that help businesses do this work. Gartner goes ahead and adds a new monicker in Data Integration Tools. Data Integration is just Data Collection with benefits. As soon as these tools started doing more than just the E, the L, and the T — someone thought it best to coin a new term. It does help avoid that ELT / ETL debate!

While Informatica, IBM, and others have proprietary software tailored for this sort of work, many other software products in the data & analytics space have ETL functionality built in. This makes it rare for most companies to have exclusively one technology, though using multiple technologies on the same data is rare as well. So you may work for an “Informatica Shop”, that doesn’t guarantee that Informatica is used on everything.

Some final associations:

Data is often extracted using scripts written in languages like Perl or Python. Often automation is used. API calls are made. Error handling and special characters need to be considered. It is more difficult to ETL unstructured data. These are all concepts we can feature in later articles.

So just remember:

ETL is data collection. It is about copying data from the source and transferring it to a usable location.

Thanks for reading. More Practical Explainer coming soon…

--

--

Decision-First AI

FKA Corsair's Publishing - Articles that engage, educate, and entertain through analogies, analytics, and … occasionally, pirates!