The Death of Data Warehouse?

Dimas Hadiyan Adiluhung
4 min readJun 24, 2020

--

With the price of compute engine is getting cheaper, massive parallel processing advertised everywhere and “big data” term becomes more more common nowadays, i heard many people say this is the end of data warehouse. Is it true? Let’s talk what is “data warehouse” first, what is data warehouse? Quote from oracle website:

A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics”

Accuracy is very important for business intelligence and data analysis in order to make impactful decision and action, that’s why using high quality data as input is a must. Where we can find that high quality data? yep, in data warehouse (or if you don’t like “data warehouse” term you can just simply change the term with data vault, data lake, data.. fine whatever). So when we talk bout data warehousing, we talk about these activity:

  1. Gathering data from multiple source and standardize the schema to avoid data multi-interpretation and for the sake of development and maintenance also
  2. Standardize metrics definition and formula, grabbing data from multiple source making high probability of redundancy of data, some information can be retrieved with multiple way and resulting different outcome, which one is true? can be only 1 is correct or maybe every version is correct! only matter of point of view. The point is need to standardize the things to avoid misunderstanding and reinventing the wheel, which quite often happen when you work on so many parties and sometimes they need same thing, end up they working on same thing with their own way and resulting different outcome.

For me data warehousing is about “governance”, not related with specific technology or tool. So back to the question “is data warehouse dead?” for me absolutely NO, even more important in this big data era which information you gathered is huge and varied. But if the question “is traditional RDBMS data warehouse dead?” the answer is vary can be yes or no, it will be different for answer for every people and organization like my boss said:

Stay humble, you came here with a lot of experiences but each organization always has its unique problem which sometimes requires unique solution”

For some organization RDBMS is the best solution but maybe not the case for other organization, maybe other organization need DBMS (without R) with MPP capability or some organization doesn’t want to adopt DBMS at all and fully adopt MPP by decoupling storage and compute engine and manage everything by themself, some organization picks serverless architecture which actually MPP but infra arrangement is managed by vendor. Every organization has freedom to pick their tech stack, depend on situation they face, technology maybe different but the ultimate goal to govern the data will never change.

How about well known method in building data warehouse like dimensional modeling, building star or snow flake schema, always denormalized, ETL/ELT, Slowly Changing Dimension etc? Are they dead in non-RDBMS environment? can be yes or no, method can be different but goal is still same. Let’s take example on SCD which the goal is to keep historical data changes in data warehouse and matching historical data with fact record easily by implementing surrogate key, start and end date of the record and sometimes record version columns on dimension side, it’s suitable RDBMS which has DML feature, also usually built-in feature in ETL tool like PDI. how about SCD in Spark, Beam or in BigQuery? is SCD dead in those framework/technology? it’s hard to answer that, in BigQuery my team try to implement what is similar with SCD but without start and date columns which SCD type 2 signature, we use different method but the end goal is same! to keep historical changes in data warehouse.

Star schema illustration
common SCD type 2 in RDBMS illustration

Back to question “is data warehouse dead”? the answer is depend on data warehouse definition. If data warehouse is RDBMS system to support data analysis activity and BI then the answer can be YES or NO, depend on organization situation and necessity. If data warehouse is method to govern the data, then the answer is NO.

Technology change overtime, but the ultimate goal to “govern” the data in order to produce meaningful and impactful insight will never change

Reference
[1] What is a data warehouse? https://www.oracle.com/database/what-is-a-data-warehouse/#link3

--

--