Some thoughts on Data warehousing Architectures + BI
I see analytics in 3 abstract parts: Descriptive analytics, predictive, and prescriptive analytics. It’s good to think of these before choosing the right DW. I’m more interested to explore and compare Data warehousing architectures of the traditional ones with more recent Hadoop based DW(sql on hadoop). Many articles online seem to conclude, that comparing hadoop based DW with traditional DW is like comparing Oranges and Apples. I also want to see if this schema-on-read(ELT) vs schema-on-write(ETL) really affect the decision making.
First, Traditional DW: There are three major aspects of the data warehouse architecture:
- The source systems ( Data from some sources — a subset of data which has to be pre-defined, and thought to be important for the business hence significant upfront analysis — which also means adding a new data source later could be a pain)
- The data feeds ( which can include the classic ETL: the process by which the data makes its way to the DW. The most important thing to note here is that with this ETL, only CLEAN data is fed into the DW, and mostly batch oriented)
- The data warehouse(s) ( Large organizations can have issues where sections of the company typically bypass the main data warehouse and feed directly from the main source system — not good)
Classic architecture where the Data Marts instead of reading from DW read directly from the Source data feeds because adding data to a Enterprise Data Warehouse, as discussed, is slow and cumbersome, some departments simply bypass it and create like their own mini DW. Eventually the ‘Enterprise Data Warehouse’ in the middle completely disappears — leading t0 lack of standardization of data and ‘DATA MART FRAGMENTATION’.
The move is more towards Real time DW. For example, Operational BI(reporting results as its happening, production performance vs quota, Fraud detection etc.)needs data as quickly as possible. But DW were more aimed towards descriptive and historic analysis. And this issue has been addressed in multiple ways, like doing: More frequent ETL’s, making spin off real time Data Marts et cetera). AND this leads to the ‘analytics gap’. Traditional DW/BI are challenged when it comes to:
- Unlimited source data
- Quickly adding new data sets
- Real time analysis
- Semi structured and unstructured data(+1)
- More predictive and prescriptive analytics
DECREASE IDEA TO INSIGHT LIFECYCLE
AND this leads to Hadoop. Let’s look at 2 Hadoop DW approaches:
- One approach, would be to have super sized hadoop Data staging area sitting besides our traditional DW solution. Now you can add as many datasets and data sources with some support for real time analysis and your business reports/dashboards from the Traditional DW remains the same. Those Blue squares represents multiple Data sources.
- Second approach, totally shift out Traditional DW and replace it completely with the non-relational Hadoop ONLY. We don’t copy it into any traditional DW from Hadoop. With this you can address all 5 concerns as described above required for analytics. With this our architecture is also much simpler. You can use Hive/HQL for running sql on hadoop which further uses map reduce as its core execution engine. This a definitely a viable option with newer release of hadoop 2.x (Although I’m not sure why TeZ or Spark over HDFS has not taken over completely)
ETL(Traditional BI/DW) Import clean and normalized data that is known and understood beforehand. Schema-on-write
ELT (Big data Hadoop) Bring ALL the data first and load via batch or streaming. Cleansing and standardization is deferred to later stages. Schema-on-read. We can run our data mining algorithms for our predictive and discovery analytics against “dirty” data that hasn’t been cleansed as long as we realize that we are forming hypotheses rather than producing hard facts
As far as I know, Hadoop has not yet replaced a data warehouse. Instead, Hadoop has been able to peel off a few workloads from an IDW. Migrating low-value data and workloads to Hadoop is not widespread, but neither is it rare. One workload often offloaded is extract-transform-load (ETL). Technically, Hadoop is not an ETL solution. It’s a middleware infrastructure for parallelism.
Hadoop requires hand coding of ETL transformations, which is expensive, especially when maintenance costs pile up in the years to come. Simple RDBMS tasks like referential integrity checks and match key lookup don’t exist in Hadoop or Hive.
Evolution of DW with hadoop
So newer vendors trying to get sql and BI work on hadoop have bypassed map reduce and made their own proprietary solutions to the data in HDFS. What that means then is that a lot of the performance constraints of the earlier years for trying to get relational access to the underlying Hadoop data(slow map reduce), we no longer have to deal with and we’re starting to see some excellent performance from Business Intelligence tools going against Hadoop data. These vendors include: Pivotal, IBM, cloudera.
Again, Hadoop 2.0 has significant improvements over Hadoop 1.x and is very much a feasible option for building reliable DW.
Building the Roadmap
How can you move from today’s or traditional warehousing into big data warehouses.
- Assess your current requirements and environments(skills, costs etc)
- Discover and counter mis conceptions about big data
- Planning: Budget, schedules, contingency plans, migration and deployment. Experiment with ELT’s. Add resources(Apache Sqoop, flume etc)
Some Reasons not to choose Hadoop (just yet)
- If analytical needs are overwhelmingly strategic and descriptive — looking at historic data and less of data mining or predictive analytics.
- If data usage is fragmented and compartmentalized — not cross functional.
- Culture and adaptive to change