Re-thinking data lake architectures with google cloud — part I

Parviz Deyhim
Google Cloud - Community
5 min readApr 29, 2021

The data ecosystem has come a long way since the days of using databases and expensive traditional data warehouses to make sense of large amounts of data. Organizations commonly store and analyze data using open-source tools hosted on-prem or in the cloud on platforms capable of processing more data than ever. Nowadays, data lakes and data marts are the most common platform architectures. Data lakes allow organizations to consolidate data from various operational data sources into a single storage system. Data marts are an extension of a data lake that serve as a more specialized view of data, tailored towards better data consumer performance.

While the data-lake architecture has been instrumental in pushing the boundaries of data processing, it has diverged from common infrastructure and software engineering best practices. While the architectural best practices advise against building large and hard-to-test monolith systems, majority of today’s data lake architectures are exactly that: large monolithic systems with a high degree of complexity and dependencies that are hard to test, deploy, and iterate on. And the main reason organizations have been stuck with a monolith architecture is due to a lack of technological features, allowing the adoption of simpler architectures.

In this post, we’ll explore the challenges of the current data processing landscape. In the next post, we’ll discuss how some of the most recent innovations in google cloud ecosystem help solve those challenges.

Current data architecture landscape

Current data architectures often consist of a single data lake surrounded by data marts, a more domain oriented, curated version of the data lake. While the definitions vary, data lake is generally referred to as a platform that allows organizations to store data in a single place rather than isolated storage systems.

On the other hand, data mart is an extension of an existing data lake, hosting an optimized view of organizational data on a more specialized and performant storage system. For instance, organizations commonly implement data-lake with a more general purpose storage system such as Google Cloud Storage or HDFS for on-prem data centers. General purpose storage platforms provide more flexibility for the data producers (operational datastores) to store data in raw format and are less optimized for data consumption. For that reason, organizations leverage distributed systems, such as Apache Spark, Hive, and Presto to copy and transform data from a general purpose storage to a more specialized storage that meets the performance requirements of data consumers. For example, data gets stored in GCS in raw format, the system then copies the data into BigQuery for BI analytics and potentially into another GCS location for Machine Learning consumption.

Challenges of data-lake architectures

The traditional data-lake architecture has several challenges:

Operational burden of managing multiple storage systems: Data-marts by definition are a subset of the overall data architecture and often leverage an optimized storage system. While this specialized storage system improves the consumption performance, additional storage systems increases the operational burden of maintaining multiple systems that almost never have anything in common. Organizations using this model are typically unable to scale due to the prohibitive cost of these specialized systems.

Operational burden of managing transformational workflows: While having multiple storage systems increases the architectural complexity, the true complexity arises from connecting and orchestrating the various parts of the platform. With multiple storage systems, the need for complex data-movement introduces the need for operationalizing data workflows to run reliably and consistently. Workflows commonly fail or slow down and that can result in stale data and inconsistencies. This is especially cumbersome when organizations take an ungoverned approach, using multiple data- transformation frameworks. For example, often larger organizations simultaneously use Apache Spark, Hive, and a variety of other open source tools to transform and copy data. With limited data sources and transformation jobs, the complexity and the time to keep the source and destination systems consistent is manageable. However, as the number of data sources increases so does the workflow complexity and the system reliability challenges.

Inability to improve data freshness: Traditionally, the data movement from source to data lake has been implemented using batch frameworks such as Hadoop (Hive, MapReduce, etc) or Spark. But increasingly, data consumers and lines of business want access to the most recent view of data and for that reason adopting streaming frameworks such as Apache Flink and Apache Beam has been on the rise. However, due challenges such as the inherent risk in changing the existing complex batch transformation workflows, the streaming nature of data ends once data lands in the data lake. In this model, proceeding workflows, often again implemented in batch, are responsible for transforming and copying data to data marts. While this approach accelerates the arrival of data to the data lake, it rarely improves the end-to-end data freshness.

Data domain decay and expensive specialized data team: Specialized storage systems and complex transformational workflows create the need for organizations to create specialized teams of engineers (i.e data engineers) to build and maintain reliable data pipelines. Since there is not an abundance of resources skilled in dealinging with large amounts of data, this specialized engineering team is generally a shared resource between lines of business. As data gets stored in the data lake, this specialized team is responsible for nurturing data until the point of delivery to the consumers. The shared organizational structure often means that the team has a minimal understanding of what the data represents which often results in delivering a version of data that may be removed from the original business domain. Organizations should investigate the concept of Data Mesh, brilliantly defined by Zahamagh Dehghani, to better understand the problem and recommend solutions. In the next blog posts, we’ll explores different approaches to building data lakes that can alleviate some of these issues.

Security: In most data platforms, data lakes and data marts are different systems with minimum common features and functionalities. This poses a challenge where applying a standard set of security and compliance requirements across all systems. It is a fragmented approach and inherently reduces the overall security hygiene.

Monolithic architecture impedes innovation: The data-lake and data-mart architecture connected via webs of complex data transformation workflows is often comparable to a monolith architecture. In this architecture, making a change, such as adding a data source or modifying a transformation logic, introduces a long list of changes to the entire system and its dependencies. It introduces risk and becomes a time and resource consuming process. As a result, organizations whose core business relies heavily on the existing data pipelines, shy away from making forward-looking changes, even when the change can benefit the overall business. One example of this is the failure to migrate from an existing batch-oriented architecture to a real-time architecture because it means re-architecting the entire system.

In the part II of this blog post, we’ll look at how to solve the challenges mentioned above.

--

--

Parviz Deyhim
Google Cloud - Community

Data lover and cloud architect @databricks (ex-google, ex-aws)