The 5 Data Consolidation Patterns — Data Lakes, Data Hubs, Data Virtualization/Data Federation, Data Warehouse, and Operational Data Stores
How to choose the right one, and why you may need more than one
Data Architects and Enterprise Architects are often asked about what kind of data store would best suit the business.
Kimball refers to the integrated approach of delivery of data to consumers (other systems, analytics, BI, DW) as “Data Warehouse Bus Architecture”. A combination of these data stores are sometimes necessary to create this architecture.
Let’s look at the options available, and also how the augmented warehouse approach has evolved.
Introduction to each Data Storage and Consolidation pattern
Data Lake
- Data is ingested into a storage layer with minimal transformation, retaining the input format, structure and granularity. Contains structured and unstructured data.
- Multiple sources of data — bulk, external, vendor supplied, change-data-capture, operational — are captured and hosted.
- Business use-case driven adoption, providing value to users from inception.
- Control on data ingested, and emphasis on documenting structure of data.
- Generally useful for analytical reports, and data science; less useful for management reporting.
- Data Lake is a data store pattern that prioritizes availability over all else, across the organization, departments, and users of the data. Easiest to onboard a new data source.
Data Hub
- Data ingested into a storage layer, with some transformation/harmonization. The input formats and structures are altered, but granularity of source is maintained. Contains structured and unstructured data.
- Multiple sources of data are hosted, including operational, change-data and decision serving.
- Hub and Spoke model
- More control, formatting, and gate-keeping, as compared to Data Lake
- Like Data Lake, can also be effectively used for data science
- Many consultants are now advocating Data Hubs over weakly integrated and governed Data Lakes (see article link in references by Dave Wells, Eckerson Group)
Data Virtualization / Data Federation
- Data is not ingested, but referenced from other data sources. Remote connections are established, and use a clever combination of technologies like caching, and push-down query optimizations.
- Affected by downtimes of source systems, and retention policies of source systems
- Run-time data harmonization using views and transform-during-query
- Great launchpad for an integration initiative, but with maturity, an organization could outgrow data virtualization within 5 years or so.
Data Warehouse
- Data ingested after extensive transformations of structures and granularity
- Most trustworthy source of management reports
- Tracks change to reference data over time (Slowly changing dimensions)
- Inflexibility, and preparation time in onboarding new subject areas. The transformation logic and modeling both require extensive design, planning and development.
- Unable to service queries related to new subject areas, without necessary data preparation.
- Information Lifecycle Management (ILM) is often best implemented consistently within a Data Warehouse with clearly defined archival and retention policies.
ODS (Operational Data Store)
- In use for many years. Typical use cases are mainframe databases mirrored to provide other systems access to data.
- Mirror copy of the source transaction system. The system is mirrored to isolate and insulate the source system from the target system usage pattern and query workload.
Comparison
Here is the table of comparison. Each parameter is ranked (not scored) by desirability (4 = highly desirable descending to 1 = least desirable).
*The governance is the default governance level. Tools like Apache Atlas enhance governance of Data Lakes and Hubs. The governance of Virtualized databases and ODSs are relegated to source systems.
+The ILM(Information Lifecycle Management) ranking is the default/commonly occuring ILM level. Possibilities exist to enhance it for Data Lakes, Data Hubs and Data Warehouses. The ILM controls of Virtualized databases and ODSs are set by the source systems.
This ranking sheet is meant to give you the choice based on your requirements, and the parameters that matter to you. Each parameter can be assigned a weight and then you can select the right Data Storage pattern appropriate for you.
Again, I will re-iterate that parameters in this sheet are ranked, not scored. Scoring will depend on specific technology choices and considerations like use-case, suitability, and so on.
Choices & combinations
The Data Warehouse is a permanent anchor fixture, and the others serve as source layers or augmentation layers — related or linked information.
Augmentation of the Data Warehouse can be done using either Data Lake, Data Hub or Data Virtualization.
The data science team can effectively use Data Lakes and Hubs for AI and ML. The data engineering and ETL teams have already populated the Data Warehouse with conformed and cleaned data. Feature engineering on these dimensions can be readily performed. The reports created by data science team provide context and supplement management reports.
The Data Hub provides an analytics sandbox that can provide very valuable usage information. Repeated analysis can be slowly built into the Data Warehouse, while ad hoc or less frequently used analysis need not be. The ETL/data engineering teams sometimes spend too much time transforming data for a report that rarely gets used. The commonality of usage and requirements can be assessed using this usage data, and drives dimension conformance across business processes and master data domains.
Uptake of self-service BI tools is quicker if data is readily available, thus making Data Lake or Data Hub important cogs in the wheel. Without the data or the self-service tools, business users lose patience and cannot wait indefinitely for the data to be served from the warehouse. Then we end up with data puddles in the form of spreadsheets :-)
Cloud data-warehouse vendors have now added additional capabilities that allow for Data Lake or Data Hub like storage and processing, and provide an augmented warehouse or warehouse+ architecture. Examples are RedShift + Redshift Spectrum, Snowflake, BigQuery + DataProc:Presto, or Data Warehouse and Virtualization in SQL Server 2019.
Conclusion
To service the business needs, we need the right data. The right data should be in the right usable structure, effective governance and the right architecture components. The products and the capabilities provided should be selected based on the business needs for the data. Such a data analytics environment will have multiple data store and consolidation patterns. Each store will service specific needs and requirements.
The discussion and comparison in this article will be important to decide on the most suitable data storage and consolidation pattern. It can also be useful when performing an Enterprise Data Architecture review.
References
Feldman, D. (2020). Data Lakes vs Data Hubs vs Federation: Which One Is Best?. MarkLogic. Retrieved 2 March 2020, from https://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/.
Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The data warehouse lifecycle toolkit. John Wiley & Sons.
Agrawal, M., Joshi, S., & Velez, F. (2017). Best Practices in Data Management for Analytics Projects. https://www.persistent.com/whitepaper-data-management-best-practices/
Wells, D. (2019, February 7). Data Hubs — What’s Next in Data Architecture? Retrieved March 17, 2020, from https://www.eckerson.com/articles/data-hubs-what-s-next-in-data-architecture