Data Lakes, Data Warehouses & Data Hubs
How to combine them ?
Data Lakes, Data Warehouses and Data Hubs are key components of every Data landscape. Knowing strengths and weaknesses of each solution is important. Being able to combine them in an agile way can be a challenge. In this article, you’ll find a proposition to combine them in a simple way.
First, let’s brief agree upon the different component’s definitions.
The Data Lake
The Data Lake is a central repository that contains structured and unstructured data. Data quality is usually low with raw data that are not transformed.
“ A gigantic shared folder with millions of files more or less organized”
The Data Warehouse
The Data Warehouse is a central repository of structured data from multiple sources. Data quality is high, and it is used for reporting and dashboarding. Data are loaded in the Data Warehouse using ETL typically.
“A big database to power reports and dashboards”
The Data Hub
The Data Hub is a central place to share and facilitate data exchanges between applications. Data quality is high. Applications are connected to the Data Hub through API typically
“A database surrounded by APIs to facilitate exchanges with applications”
Comparison
A basic summary of each component:
Combining them together
Since they all serve different purpose, you will quickly end up deploying all of them. Here is a proposition to articulate and position them into your landscape:
Conclusion
You now have the basics to articulate your Data Warehouse, Data Lake and Data Hubs.
Sources of inspiration
Here is a non exhaustive list of articles that were source of inspirations to write this post.