Open Data Architecture at scale on Cloud

7 min readMay 9, 2022

In this article, we will talk in detail about the enablement of “Open Data Architecture” for your data ecosystem which aims at preventing vendor lock-in, and permeates through every aspect of the data ecosystem, like data storage, data management, data processing, data foundation, data access, consumption.

Hadoop: Birth of the Open Data Architecture

For decades, companies relied on traditional databases or warehouses, for their BI needs but there were certain challenges associated with it. The traditional data warehouse system required buying pricey on-premises hardware, maintaining structured data in proprietary formats, and relying on a centralized data and IT department to deliver analysis. There were further challenges in terms of technological interoperability, system orchestration, and most importantly scalability.

Hadoop came into existence in 2006 based on the Map-Reduce paradigm which could process in parallel and generate huge data sets over large clusters of commoditized hardware. This framework supported the processing of massive datasets distributed across computer clusters, making it a hugely attractive option for enterprises, which were collecting more data by the day.

Rise and Fall of Hadoop

The Data Management world celebrated the advent of Hadoop and the ecosystem of tools that were then available to harness the BIG data. One striking aspect of this wave is that many organizations started contributing to the Hadoop Ecosystem and there were a lot of tools (Hive/Pig) developed and donated to Apache. This essentially made the entry barrier low for any organization/team to try and leverage Hadoop and its Ecosystem tools for their Data Management Needs.

With this euphoria came a lot of challenges in maintaining the Hadoop ecosystem like an on-premise setup, cost, and maintenance. Companies would race to collect more and more data, but they weren’t considering architecture design around access, analytics, or sustainability resulting in data swamps.

One of the other major challenges that the Map-reduce framework had was its complexity and notoriously difficulty for end-users to understand and operate. One of the Prominent inventions of this Big Data Movement is Spark, which is an in-memory Distributed Processing Framework that entirely changed the way of interacting with data in terms of volume, veracity, velocity, and variety. But again Spark had its fair share of complexity in terms of scalability, maintenance, and cost in an on-premise setup.

From a storage standpoint, transaction or the ACID ability for the Hadoop file system was the most expected feature along with time travel, concurrence read, write, schema evolution, schema enforcements, compaction, etc. which was not available as out of the box capability. There was a need for inexpensive, monolithic hardware that enables organizations to scale data volume without the associated management overhead

Cloud to the Rescue

One of the promises of the Big Data Movement was to unlock the potential the data had and enable new use cases/insights that weren’t possible in the past. With all the Hadoop tools available, next came Cloud which drastically changed the way the IT infrastructure was especially for Data management needs. Be it object storage, compute on demand, back and recovery, availability of different machine types, pay per use all made the cloud ecosystem irresistible for any organization be it small or large.

Major cloud providers started to provide a multitude of IaaS, PaaS, or Saas services around Data & Analytics capabilities taking away the major pain points around infra maintenance, support, and scalability in a cost-effective way. Distributed compute offerings have been game-changers in this space powered by Apache Spark, Hive-like frameworks, and distributed storage with services like Object Store. Thanks to services like Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), companies can house structured and unstructured data at scale in cloud-native data lakes. Also, the table formats like Delta-lake, Apache Iceberg, and Apache Hudi supported by the object stores gave solutions to the typical filesystem problems like time travel, concurrence read, write, schema evolution, etc.

Future of Open Data Architecture

Although Hadoop underachieved its promise, the open data ecosystem movement it started (as well as a number of related open source projects such as Apache Spark, Beam, Hive, Flink, etc.) is alive and enduring.

Its quintessence is defined by the following principles:

Openness: A shift toward open technologies and data standards, as well as interoperability, instead of being locked in with a single proprietary vendor.
Composable: Define your data architecture in a way that it decouples storage and compute, supports scalability, isolation, concurrency, extensible, transiency, and automation
Heterogeneity: Supports a wide range of tools and distributed frameworks for various use-cases

How the different avenues of a data lifecycle get translated along the three principles of open data architecture is explained below.

Data Acquisition

Data acquisition lays the foundation for data extraction from source data systems and the orchestration of different ingestion strategies in a data lake. It requires dealing with a multitude of data sources, formats, and frequencies. Traditionally, ETL tools like (IBM Datastage/Talend/Pentaho) were utilized but over time we had to deal with a load of new formats and integrations where they hit a roadblock in terms of scalability, interoperability, and integration.

The solution was to start utilizing Open-Source and Distributed Processing for developing ingestion frameworks(Spark/Flink/Beam etc.). This approach of using Distributed Processing for ingestion frameworks gives us good flexibility, and an unopinionated way to define Data pipelines adds to the ingestion framework with which we can truly build powerful Data Ingestion Pipelines that scale well. With this flexibility, we should be prudent with the way we design the Ingestion Processes, utilizing the Software Design Practices and Patterns to drive Design Decision Making helps to achieve Modular components and a wide range of integration options. This makes composing Data pipelines a breeze and promotes code reuse. There could be some exceptions as few organizations still rely on legacy systems.

In the Data lake or Lakehouse paradigm, it is represented by the Raw layer and requires the data to be ingested and stored in the native format.

Data Processing

Data processing is massaging and churning phase of the data architecture which is inclusive of data enhancement, augmentation, classification, and standardization of layers. It includes processes for automated business rules’ processing and processes to derive or append new attributes to the existing records from internal and external sources, aggregation, metrics, features computation, etc.

In the Data lake or Lakehouse paradigm, it is represented by the Silver(Refined) & Gold(Trusted) layers respectively. Some of the key capabilities for this layer are ACID compliant, schema evolution, Support Merge operations, time travel, open-source format, data versioning, and concurrency. Some of the table formats loaded with all these capabilities for the processing layers are DeltaLake, Hudi, Iceberg, etc (with file format as parquet, ORC, and Avro under the hood).

These table formats are supported by a wide variety of distributed frameworks and programming languages with all the main cloud platforms/services supporting these open data formats interoperably on object stores. The composable architecture w.r.t. compute required for processing could be defined by submitting work to a cluster (aka Job) which is ephemeral in nature ( which means it gets terminated after process completion) You can create and run a job using the UI, the CLI, or by invoking the API. The easiest way would be to assess, then define different t-shirt sizes of varying cluster configurations and submit jobs based on your compute workloads. Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies.

In addition to refining the data, one important aspect is to have data quality measures defined for each of the datasets, and sanity checks are done on the datasets to build trust in the data assets. Automate the metadata collection that monitors various aspects of the Data assets which help in cataloging, lineage generation, etc.

Data Consumption

This is the most critical phase of data architecture, which consumes and serves the output provided by the processing layer. Data architecture is successful only when it allows rapid access and consumption with the right governance onboarding of users, irrespective of their location and which tools they use. These consumers should be able to point their tools and access the data without any development or operational skills.

Consumers of data from a data ecosystem consist of two types:

Processes and workloads within your data ecosystem like data processing & analysis tools, AI/ML solutions, data exploration via a querayable interface, and API layer.
Processes and workloads outside the data ecosystem to downstream systems,data exploration via a querayable interface, data as a product, and data syndication etc.

The open data architecture supports common ways of data access from the data lake, including SQL, APIs, search, exports, and bulk access. It allows different personas( data analysts, data scientists, etc.) to do their work at the abstraction level. It supports interoperability, integration, and scalability based on the user and usecase requirements respectively.

The data access could be controlled by a combination of RBAC mechanisms along with fine-grained access control offered by services like AWS Lake formation, Azure Data Share and Google BigLake, etc.

To summarize the benefits offered by open data architecture are interoperability, a multitude of integration options, scalability, decoupled storage and computing, avoiding data redundancy, cost-effectiveness, faster response times, ensuring you’re not limited to a particular tool and you’re not locked into a particular vendor, open to future technology, etc. to name a few. The recommendation is to understand your needs, plan, design, and execute your data ecosystem accordingly.

Each of the pillars of the data cycle further requires double-clicking in terms of storage, computing, orchestration, monitoring, observability, cataloging, metadata collection, security, data access, etc., which will further be continued in the subsequent blogs…

Hope you found it helpful! Thanks for reading!

Let’s connect on Linkedin!

Subsequent Blogs/Stories

https://medium.com/@DataEnthusiast/designing-compute-storage-for-composable-open-data-architecture-on-cloud-61228d0e31