Pedro Fidalgo
techburst
Published in
5 min readDec 29, 2017

--

ETL Architecture and Techniques Overview

Data Warehouse is almost an ETL synonym, no Business Intelligence project will see light at the end of the tunnel without some ETL processes developed. ETL stands for Extract Transform and Load and it presents itself as a quite broad concept but indispensable on this kind of projects. You should assume data will be extracted from a place (or places), transformed and inserted into its final destination.

For more details on ETL, see this ETL wiki. A generic schema can be found below.

Things have changed a lot and nowadays data sources are so vast the first part of the equation may represent a very broad range of possibilities. This means it may be required to load simple CSV files, pull data using a web service or link to cloud based databases like Azure or Amazon Web Services. And even worse imagine merging or crossing data from all those different data sources.

The transformation part puts the information in the desired format to feed the created decision support database model, usually a highly denormalized one (once response speed is important).

All this makes developing processes that are able to extract data from such different data sources much more challenging and requires Data Warehouse developers to extend their know how. Nowadays data Warehouse projects are not limited to the traditional database scope and receiving data from plain relational ERP systems with full structured data. This trend is easily observed when you look at the quantity of data sources and data transformation objects present in tools like Microsoft SQL Server Data Tools (former SSIS). SSDT even provides developers access to the entire .net framework that allows them to produce really complex processes. Other ETL tools like Talend base a lot of its concept on Java framework and even more recent IBM Datastage version allows developers to use Java developed components inside data flows. Developers should indeed consider spend some time getting pure software development and algorithm development skills.

That is why nowadays the trend is Data Science and Data Scientists as more know how is required and languages like R are today part of SQL Server installation package (although in my opinion massive data processing is faster using plain SQL language). And you also have Big Data …

On the other hand Business Analysts usually work on a higher level, mostly discussing and analyzing business rules requirements and making sure the implementations verify the specifications defined. Their main aim is not making sure the sales data is properly analysed or the financial CFO reports run fast. Assuming Business Analysts work mostly on the process of software development the possible ETL requirements should be taken into consideration since the first planning step as well.

The usual business process analysis (that leads toward a presented solution) is on most cases directed to the day to day business management. The Business Analyst should be able to see and understand the full information cycle and should be able to understand the specification she may be defining (that may be used to develop a set of web services) will condition the data sources for possible decision support systems.

How could we minimize the risk of developing highly complex business rules that would make ETL development a nightmare? I once worked for a project that should be able to simplify the process of data migration from an Oracle EBS platform into a proper SQL Server decision support system. In the end it would allow the developer to define data flows easily and build the Data Warehouse dimensional models using some sort of wizard approach. Let me tell you it was suspended, why? Complexity.

So, when dealing with small businesses the process of running regular ETL processes may be replaced by direct data migrations or database mirroring techniques. Sometimes, the underlying data model that is used to support the business processes has similarities to a proper Data Warehouse dimensional model so producing online reports may be possible mostly due to much higher machines processing power we have nowadays.

I think, for small business process systems, it is possible to use non invasive “triggering” or independently scheduled processes approach that, if executed on side of the main system, should be able to extract the relevant data from it (they should connect to it and not depend on it). This said and, once machines are becoming so powerful, copying several thousand records without complex transformations is something that can take few seconds (for instance IBM Datastage is able to run ETL processes that acchive over than 150~200K lines per second).

If data migration processes or database replication processes are highly simplified the pure ETL processes tend to be replaced by them so Business Analysts can focus mostly on daily management business processes and forget future report and data analysis needs as they can be executed almost in real time.

Either way this should be possible for small and not highly complex business rules as this of approach has limitations mostly related to the size of the data sets being analysed. There is always a mid-term so, reports or analysis that are based in small data sets could be executed in real time others more complex could be executed by proper ETL processes. Complex reports that analyse long data series may require off peak scheduled processes to run and transform data properly.

Pure data migration processes, call them ELT or whatever you prefer

  • Faster to develop, it is possible, using powerful hardware solutions to replicate terabytes of data within few minutes;
  • Easy to maintain;
  • They run faster once they perform minimal data transformations;
  • They are directed for simpler data analysis;
  • Overhead is passed for reporting time as data transformation is executed at report refreshing time;
  • Some reports can become more complex due to the lack of a proper Dimensional model behind;
  • Some reports may run slower;
  • Traditional exploring tools may not adapt easily to non Dimensional modeled based sets;

The traditional ETL approach

  • Allows the development of proper data quality and transformation rules
  • May require specialized tools to develop those processes
  • ETL chains can take some time running so they usually cannot run when the system is on-line
  • Requires good data rules and data quality definitions

So as conclusion and as usual each project has its own nuances. Either way it is always possible to mix approaches and use plain ETL where it makes sense and simpler online data migration techniques on other parts of the project. Even simpler VBA macros can make miracles by pulling data from operational databases and automating dashboard creations so no need for any kind of Data Warehouse environment.

Before spending eight months developing a Data Warehouse solution think, will I really need it?

--

--

Pedro Fidalgo
techburst

Database developer that handles software developing as well …