Lakes? Warehouses? Lakehouses? A short history of Data Architecture
The concept of a data lakehouse is a merger of the data lake and data warehouse. Learn the history, evolution and current trends shaping the industry.
At QuantumBlack, AI by McKinsey, we focus on providing advanced analytics solutions for clients, which often requires building or upgrading that client’s existing data platforms. Over time we’ve collectively poured over hundreds of corporate technology stacks and eventually, trends resemble the blur of F1 cars speeding around a circuit — all with familiar patterns and colours.
The process of modernising these data platforms has become an increasingly complex endeavour in recent years thanks to the rapidly growing number of architecture solutions emerging onto the market. Engineers are faced with almost too much choice and often find themselves spending an inordinate amount of time deciding on which approach would be best for their project.
With this in mind, we are highlighting the various available platform archetypes and exploring two trends currently stirring industry discussion: Cloud Data Warehouses (CDW) and Data Lakehouses.
A subset of the complexity in the data world
Designing and leading implementation of scalable data and analytics systems has led us to accumulate a significant knowledge base of successful, modern patterns and best practices when working with clients.
This knowledge allows our team of industry leading data practitioners to implement QuantumBlack’s tried and tested approach. To begin, we organise a client’s data and analytics systems into a common architecture blueprint that includes the major layers such as ingress, batch/stream processing, storage, egress, and governance. We then apply the same framework to the major modern data platform archetypes: Data Lake, Lakehouse, Cloud-native DW and Data Mesh. Finally, we reflect these archetypes into specific best-of-breed solution architectures customised to specific public cloud platforms like Azure, AWS or GCP.
From here we work to codify each solution architecture with suitable Terraform scripts for automated deployment leveraging the Matter Provisioning asset of Cloud by McKinsey due to its strong focus on best practice cloud architecture, security, and governance.
This framework that we use on large transformational projects results in clients scaling at speed and keeping pace with an ever-changing industry. In the below article we explore these platform architectures highlighting the history, industry trends, benefits and some limitations when implementing each.
History and Evolution
The term Data Lake (DL) originated in 2011 from data vendor Pentaho (now Hitachi) as a way to reduce data silos that were forming in Data Warehouse-based ecosystems. The technology started to gain momentum with Hadoop around 2015 and became more of a standard approach with the rise of cheap and scalable cloud storage that underpins the technology.
The DL allowed unstructured object storage and differed from the classic Data Warehouse transform-load methodology which required schema modelling up-front. In the DL paradigm, data is loaded and then transformed at read-time, for instance ‘schema on demand’, with emphasis on data storage in raw un-modelled form.
Industry Trends in Data Lakes
In the case of Data Lakes, the process is ELT (extract, load, then transform) — inverted from traditional data warehouse approaches. The process involves loading data in raw masses, however it can suffer from a lack of understanding of what data has been acquired or the quality of it.
Further trends we saw across the industry included clients:
- Building or refining data catalogues, discovery tools, and governance to improve awareness of what is available and reduce duplication while enabling data use at scale
- Building a standard layered approach to storage and processing — typically a three layer approach: landing (raw), normalised (trusted), production-ready (refined)
- Moving off Hadoop and onto Cloud-native implementations including the move to Spark-based compute
Much of what we see with DL is similar to the DW, but with more emphasis on refining the operating models and adding a governance structure.
Data Lakes are a cost-effective architecture when used as a simple starting point if enterprise aggregation is required. If a client wishes to move slowly through implementation, the build can also be done gradually with components being added over time. Data Lakes are also highly effective when data sources and use cases provide a good fit with both Batch and Lambda formats.
With any architecture there are some limitations. Implementation can require higher data integration complexity which places more emphasis on using the right tooling. For very large organisations a fully centralised architecture can also create bottlenecks for value creation and agility, which can prove costly. Finally, BI and analytics performance is relatively poor, however, a Data Warehouse or Data Mart serving layer can be added to support SQL endpoint should a business need it.
Cloud-native Data Warehouse
History and Evolution
The moderately unchanged Data Warehouse (DW) has stood the test of time for over thirty years — quite a feat in today’s rapidly evolving landscape. The original concepts date back as far as the 1960s, but the first dedicated decision support system was created by Teradata in 1983. It quickly grew in the 1990s with pioneers Inmon and Kimball and was backed by both publications and technology firms. By the year 2000, all major database vendors had a huge focus on DW, and other incumbent technologies like OLAP and Columnar/MemoryDB’s that had changed the industry. However, it wasn’t until this past decade when Hadoop and Data Lakes entered the industry, that DWs had any real challengers.
Industry Trends in Data Warehouses
Over the past decade, the common trends we have seen with clients are often the result of massive growth in data demands that required:
- Modernising — moving to pure cloud-based solutions and also improving performance and total cost of ownership
- A focus on data governance — many clients no longer have updated catalogues and suffer from serious data quality issues, security, metadata management, and regulation or compliance risks
- Organisational restructuring between centralised and decentralised data teams — often settling on a hybrid model with decentralised data domains
Much of what we experience with DWs is the need to migrate to cloud and improve governance given their long on-prem legacy in most companies.
Knowing when to use Data Warehouses is important. The benefits are most felt when skills and infrastructure are biassed to SQL and there is a desire to drive SQL based data democratisation within an organisation. DWs are also good when a client’s use case favours BI, reporting and analytical processes rather than advanced analytics. If a client is migrating from on-prem SQL systems to the cloud and modernising the infrastructure for example, DWs allow you to do this without adding complexity.
Drawbacks of the architecture can include cost, as storing and running all data transforms and end-user analytical BI compute can be expensive. Governance is also a key area to consider. While it can be easier to manage than DLs, it can easily lead to disorganised data without the correct framework. Finally, amalgamating with applications like Python can lack an integrated workflow and scale-out method.
Brief History and Evolution
The concept of a Data Lakehouse (DLH) is a simple merger of the Data Lake and Data Warehouse. But due to its roots, it is more like a hybrid of Data Lake with RDBMS/Warehouse-like features, including ACID transactions and stronger schema enforcement and data governance.
Industry Trends in Data Lakehouse
DLH is a popular choice for clients accelerating their advanced analytics and machine learning capabilities.
The key trends we experience when working with clients include:
- Modern analytics-optimised, efficient data formats (for instance Delta Lake, Iceberg and Hulu) capable of ACID transactions and other sophisticated features like data versioning, auto-partitioning and real-time stream optimizations.
- Simplified, on-demand, elastic Spark for general purpose compute via Python and/or SQL
- Merging of DW and data science capabilities in a single platform to reduce operational/management complexity
DLH has created an alternate solution for Data Lake heavy users who need more data controls, but do not want the full overhead of a separate DW.
Utilising this merger of Data Lakes and Warehouses provides unique benefits and can prove cost effective, particularly with large data sets and heavy analytical compute patterns. DLHs are also valuable if a client does not want to maintain both a Data Lake and Data Warehouse and wants to consolidate with both DL features, like flexible storage. Furthermore, if a client places emphasis on Machine Learning, DLHs are particularly good for use cases that require and prioritise raw data and compute.
Limitations of the technology do include a lack of vendor choice as currently it’s only supported by one commercial provider, Databricks. However, it is backed by open-source offerings like Delta Lake. The architecture does also require expertise and programming skills over SQL which not every client will have. Finally, due to its infancy there is some functionality that needs to be built in; however, it’s rapidly maturing to solve these issues with interest from all major vendors across the industry.
Are you interested in applying these techniques to solve real-world problems? Want to rapidly launch and iterate products across a vast array of industries — or even join us in implementing these use cases with leading organisations? At QuantumBlack, AI by McKinsey we’re expanding our team of Technical Product Designers, check out our Careers page for more information.
Authored by: Doug Cha, Senior Expert; Mathieu Dumoulin, Senior Expert; Benjaluck Yuttisee, Senior Digital Consultant; and Bruce Philp, Partner, QuantumBlack, AI by McKinsey