Data Warehouse with a Lake view

yannick misteli
7 min readApr 25, 2020

--

Having had various discussions around data warehousing, data lakes and big data technologies I felt the urge to share some personal views.

https://www.zuerich.com/en/visit/around-lake-zurich

Let me start with the non-technological aspects: people and processes. In order to succeed in any data strategy, people and processes should be at the core of every concept. Business strategies are typically determined by the C-suite. The companies’ culture can hence reflect the framework set by the top leaders combined with the knowledge and experiences of employees. As a result, most companies will not benefit much by simply copying the data strategies of the big tech companies.

data governance and the mindset of a data-driven business needs to be established as part of the culture first

A much better approach is to first assess the internal capabilities and understand the cultural aspects. As someone once said, culture eats strategy for breakfast. Therefore, data governance (also data ownership) and the mindset of a data-driven business needs to be established as part of the culture. At the same time you need to hire the right people that can drive that cultural change and support decision making with technical expertise, especially in big data technologies. That is, they should know the capabilities and the limitations of cloud technology, providers and vendors. The big data bubble is still a reality and many companies jump on without knowing what they are doing (and consulting companies telling them what they want to hear, which is a lot of buzzwords). Therefore, let‘s look at some of the key aspects of the ongoing heated discussions about data lake and data warehousing.

For a better review of the warehousing and data lake topics please see here, here, here, here and here. We will focus in this article on a more limited scope and try to define some terminology first.

From data lakes and relational databases: we focus on “cloud“ data lakes which are built on top of object stores such as S3 (distributed storage). Usually, they are linked with general purpose computational engines such as Spark (distributed compute). As you can see, this is a very basic bare bone concept of distributed data processing (storage and compute).

http://saurabhsinhainblogs.blogspot.com/2014/08/what-is-mpp-massive-parallel-processor.html

On the other hand, „modern“ distributed (MPP) databases (such as Redshift, BigQuery, Azure Data Warehouse, Snowflake or Teradata) also offer distributed storage and distributed compute where the computational aspects are less flexible than with Spark but very optimized and also bring some inherent capabilities such as ACID transactions and advanced security concepts (row-level security) amongst others (thus, databases have some disadvantages namely that they can not handle unstructured data).

major differences between a file-based data lake and relational database-based data warehouse implementations are usually about schema-on-read approach, singleton transactions, and referential integrity

In short, major differences between a file-based data lake and relational database-based data warehouse implementations are usually about schema-on-read approach, singleton transactions, and referential integrity. File-based data lakes are a bit easier and flexible in terms of ingestion and landing the data, they do not require a physical data-model, data casting or manipulation to land the incoming data sets. They provide low-cost storage, flexibility, and agility. Thus, they tend to become poorly maintained data swamps due to ingestion flexibility and not having proper integrity/validations of what has been loaded, or may cause loss of original granularity or value for history if you over-clean or over-transform the data.

However, circulating back to the singleton transaction and referential integrity — why is ACID transactions something important? If you only consider business processes that consume data then you are fine — but this is seldom the case. Usually you will need some write back mechanism (end user or analytical process) — or updates of records — that either directly end up in your analytical system (data lake or database) or through change data capture (CDC). Now, there are processes that write, and simultaneously read from the very same system and hence you need to manage that. Again, most modern databases have implicit mechanisms to deal with this situation. For the data lake there is an extra concept needed which was coined Delta Lake (Data Lakehouse) or Iceberg is another solution which was also open sourced recently amongst others (Hive, Hudi ect. find a comparison here and here) to compensate for some of the shortcomings. It is apparent that for a data lake to compete with a database it is important to be able to guarantee a consistent view of the data and other aspects (such as easily updating single records).

That is also the reason why the big tech companies are leaders. It is not because they collected a lot of data of various forms. They are leaders because they can integrate this data and provide insights to the business that were not possible before

Now, the consistent view of the data is not enough for it to be useful. Only integrated and connected data brings added value from an analytical perspective. That is also the reason why the big tech companies are leaders. It is not because they collected a lot of data of various forms. They are leaders because they can integrate this data and provide insights to the business that were not possible before. In order to integrate data you need a data model to do that. Historically, databases were closely linked with that aspect given the relational nature of data — hence the term relational databases. But data modeling has nothing to do with technology per se. The widely used techniques in that area are 3NF, dimensional modeling and recently data vault modeling has attracted some momentum (here).

building an (enterprise) data model is extremely tedious and difficult but the only way to integrate data in a connected way

How these modeling techniques can be carried out has also changed drastically over the last few years. Where in the past ETL tools had to be used, MPP databases — or on the other hand Spark — are powerful enough to carry out these tasks directly which brought the term ELT to life. As just mentioned, it is indeed possible to build a data model in a data lake architecture — as a matter of fact, if we go with a pure data lake strategy, we actually must. Now, the process of building an (enterprise) data model is extremely tedious and difficult but as mentioned before, the only way to integrate data in a connected way. Just dumping files in S3 and hoping some data model will magically emerge is just an illusion — but might very well be promised by some vendors out there.

Just dumping files in S3 and hoping some data model will magically emerge is just an illusion

https://www.wherescape.com/solutions/project-types/data-vault-automation/

Hence, building a data vault 2.0 model in a delta lake architecture seems like a nice take on the data warehousing topic. Of course the very same approach can be carried out in a more traditional architecture like a relational database, and in order to decide on the best solution other aspects come into play such as usage patterns and cost. Although estimating the total cost of ownership (TCO) is extremely difficult and technologies like BigQuery or Redshift spectrum (offload data to cheaper object storage) can help you optimize your cost and hence could be as cost effective as a solution based on Spark and S3. The TCO also brings me back to the topic of organizational transition. Hence, maybe an intermediate step would make most sense for companies. This is also what can be observed, many enterprises go with a hybrid approach of data lake and relational database. In this concept, the persistent storage area (PSA) is moved to a cost effective object store and the analytical system is built around databases like Redshift or BigQuery. That is also pretty much Microsoft’s vision with their Synapse reference architecture. The organizational benefits of a hybrid (data lake with a relational database) approach are that first experiences with cloud can be gained and people with great skills in on prem systems such as Oracle and SQL Server can transfer their knowledge to the cloud infrastructure databases.

Summary,

  • People and Processes and hence cultural change should always be reflected in modern data strategies (which should be implicit for any data warehouse projects in context of a data lake or relational databases)
  • Modern databases and data lakes (with big data processing) architectures both offer distributed data storage and processing and hence are both suited for Big Data
  • Data modeling is an important aspect in both pure data lake or relational database architectures
  • Make sure that if you are going for data lake only strategy you respect singleton transactions and referential integrity where needed and consider ACID transactions amongst other shortcomings.
  • Hybrid approach (data lake & relational database) could ease the cloud transition
  • There is no silver bullet, especially not in fuzzy buzzwords

--

--