ETL — Code, Build or Both?
Data has been growing at an exponential rate. Interestingly, it’s not just the volume of data but also the kind of data that has been increasing. Who would have thought about the many dimensions of data that IoT or social channels like Twitter & Facebook have come to provide in the past few years?
There was a time when organizations across the world were enamored by increasing data volumes that increased their faith in data patterns that emerged and allowed them to be very confident about the business decisions that the data was supporting. Though this stands true even today, businesses have moved on from merely ‘USING’ data to ‘QUESTIONING’ data. And it is this shift that has encouraged the boom of the Data Visualization industry. Tableau is a prime example of a technology that has enabled businesses to ask questions of the data at their disposal.
If you read papers and articles on ETL from a few years back (circa 1995–2000), you would quickly realize that ETL was mostly about consolidating data from heterogeneous sources and pushing it into a dimensional model. The dimensional model itself was a strictly conventional data warehouse that mostly STARRED out the data into facts and dimensions. The model was built to support huge data volumes since data professionals had begun to sense the data explosion by then. It was all about ensuring superior performance in slicing and dicing the data. To that end, ETL tools played a major role in making lives of developers easier by providing a framework and a set of ready-made stages which performed oft-used transformations. This approach helped streamline and visualize the data flow better while also making data movement monitoring and future process modifications a breeze. Several posts have been written glorifying commercial ETL tools and while I agree with what they say, I struggle to understand their relevance in the current era.
In my opinion, ETL tools have 3 major problems. Let’s just call them the 3 C’s
- Capability (or performance)
Businesses invest heavily into data management and have also accepted that analytical operations can never be as fast as transactional operations. Safe to say that the world has largely come to terms with the first 2 — Cost & Capability.
The 3rd C — ‘Complexity’ has come more as a revelation to data experts than anything else. Transformational complexity was thought to have been a base that all ETL tools had covered to good measure but little did people expect the horizontal explosion of data that asked for dynamic ETL processes that identified newer metadata at run-time and allowed the business to interact and manage the data warehouse thereby allowing complex questions to be asked of the data.
Vertical explosion of data is increase in data volume. Horizontal explosion is increase in data attributes
While working on a supply chain engagement for a leading life sciences company, I saw the value that something as simple as inventory snapshots had to provide. The process that the organization had in place to manage their supply chain was static. It expected a certain metadata from the different sources which fed to the DWH. Static transformations were in place to modify the data to suit the language and grain demanded by the business and that was it. Anything new at the source which the business wanted and the whole infrastructure team had to step in to modify the ETL process. The supply chain was receiving data from as many as 100 different distributors and 1000s of retail customers each of whom had a different data language. When customer A says that they have 1 unit of a product, it could mean that he has 1 BOX of a product while customer B could mean just 1 item of the product while still another could be referring to 1 unit as 1 lb of the product. This language could change with time as well. Further still, some customers started sending more details about their inventory that they started generating through internal advancements in their data management. How was the ETL process expected to handle this level of complexity?!
This lack of flexibility is one of the major issues with commercial ETL tools that do not allow or only partially allow (IBM DataStage has a feature called RCP) dynamic data flows.
This is why traditional ETL has to be replaced with a more flexible approach like T-ETL or ETL-T or T-ETL-T or ELT. The catch here is that the additional layer of transformation that I am referring to is not simple data transformation butMETADATA driven data transformation. And this can be achieved only by handwritten code. This code could be PL/SQL procedures or Python/UNIX scripts or Java transformations leveraging source APIs, etc. An ETL tool can only provide a placeholder for this layer and that’s where it stops.
Reuse where you can; Code where you must; Never compromise.
Data professionals in the post 2000 era have had it so easy that most get frightened at the thought of custom coding and scripting. They eventually grow into architects who strongly advocate against using handwritten code because it will result in lack of maintainability and carries the risk of ‘Bug ridden code’. The sad part is that, thanks to commercial ETL technologies, ETL processes are only being BUILT and not CODED anymore. Make no mistake — I’m not blaming ETL tools at all! All I’m saying is that they are limited in their capability and to address the increasing complexity of analytics, custom handwritten code is MUST. I would go so far as to compliment ETL vendors like IBM & Informatica who have provided a plethora of integration points for XML, Java, UNIX, PL/SQL code. The young data management professionals have to start opening up to the idea of true coding.