Designing Compute & Storage for Composable Open Data Architecture on Cloud

Himanshu Gaurav
5 min readAug 7, 2022

--

This article will discuss “Designing Compute & Storage for Composable Open Data Architecture,” which offers scalability, flexibility in deployment and maintenance, reliability, compliance, and cost-effectiveness for your data ecosystem.

It is imperative to understand the aspects of the data life cycle (acquisition, processing, and consumption) w.r.t organization needs before you develop the design of composable open data architecture. While designing, the three data lifecycle aspects should be considered, which are mutually exclusive but not independent (interwoven).

Today, we have many technology choices regarding distributed data processing (Hadoop) & Storage capabilities offered across cloud platforms. How do you develop a design that caters to data volume, velocity, variety, etc., in optimal fashion w.r.t. to both compute and storage?

Distributed Cloud Compute

Distributed computing (distributed data processing) is a way to solve large computational tasks using two or more networked computers. The computational task is divided into subtasks that can be computed in parallel.

Distributed Compute Simply Explained

Traditionally, we used to have a monolithic Hadoop architecture that used to be convoluted from an implementation and maintenance aspect and required additional workload staggering and resource allocation handling from a workload standpoint. Hence, it wasn’t lightweight enough for end-users. Expanding meant buying more hardware, requiring up-front investments and months of planning and deployment.

Distributed Cloud Computing came to the rescue by expanding the traditional, large data center-based model to a set of distributed cloud infrastructure components that could be geographically dispersed (Compliance & Privacy) and other loaded capabilities like auto-scaling, ephemeral clusters, security, high availability, cost-saving, and in-situ processing of data (Instead of bringing the data to the engine, Open Data Architecture allows bringing of the engines to the data).

A cloud workspace is a cloud-based digital interface that assembles an organization's content and tools in a single, secure, managed solution that is accessible over a web interface. Similarly, Distributed Cloud Computing workspace organizes objects (notebooks, libraries, and experiments) into folders and provides access to data and computational resources such as clusters and jobs.

Although you should try to spin up workspaces(the definition of workspace could be different across cloud platforms)driven by your organization's needs and structure(Line of Business (LOB)). We still recommend the separation of Development, Pre-prod, and Production workspaces for dev, validation, and Q.A. purposes. This helps promote maximum effectiveness and creates an environment ideal for data teams that value agility over complexity. You can also bring Regional considerations to your workspace design from the compliance and D.R. strategy perspective.

Distributed Cloud Compute Deployment Architecture

You can have further categorization based on workloads corresponding to each workspace which will help achieve better performance and cost savings. Composable data architecture comes into action here, which helps define your data architecture in a way that decouples storage and compute and supports scalability, isolation, concurrency, extensibility, transiency, and automation. The easiest way would be to assess, then define different t-shirt sizes of varying cluster configurations workload types(Batch, Realtime, or Interactive)and submit jobs based on your compute workloads using the U.I., the CLI, or by invoking the API (depicted via. the “Distributed Cloud Compute Deployment Architecture” above). Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Another benefit is creating pools that help reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. This could be helpful for meeting SLAs.

Distributed Storage

While the Hadoop Distributed File System (HDFS) was great for analyzing and storing massive datasets, it did not have the reliability and compliance attributes for long-term data storage with challenges like a single point of failure (namenode), independent scalability, and three-way replication standards, management overhead, etc.

Although the internal architecture is entirely different between HDFS and Object stores, thanks to Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS), which meet HDFS requirements, companies can house structured and unstructured data at scale in cloud-native data lakes.

Today organizations are generating the largest, fastest-growing data sets, which are being dumped from various applications into cloud storage, mostly known as Data Lakes. A data lake serves as a central repository for an organization to store various types of data files (Parquet, ORC, JSON, etc.).

However, merely gathering or dumping files wouldn’t add any value unless we have some way to track & organize them and be able to make sense of it (since files are immutable). How to track and organize these files in a more accessible form that could support the data needs of various personas( DataEngineer, Data Analysts, Data Scientist, etc.)across the organization.

Table Format to the Rescue

Yipee!!! Table format to the rescue.

Table Format could be considered an abstraction layer on top of the files to track all the data files landing in our data lake storage, providing transactional support(ACID), row-level upserts & deletes, schema evolution, time travel, compaction, etc. Some examples of table formats are Delta-lake, Apache Iceberg, and Apache Hudi, game changers in open data architecture.

Table format's beauty is its ability to query data directly. It supports the SQL query engine, Spark engine, Streaming engine, etc., via open standards and formats. There is no need to push data to a database /datawarehouse for analysis.

Table format provides an open solution to securely share live data from your lakehouse to any computing platform.

Overall, table formats truly deliver on the quintessence of open data architecture:

  1. openness (avoids vendor lock-in)
  2. heterogeneity(support a wide range of processing engines).

Please understand your organization's needs well before finalizing on a table format and the working intricacies of table formats.

To summarize, this article explains the implementation of open data w.r.t. distributed computing and storage on the cloud at scale, which is nowadays referred to as Lakehouse.

P.S.: 1. Keep an eye on your usage and know the workspace limits based on your provider/cloud platform; if your workspace usage or user count starts to grow, you may need to consider adopting a more involved workspace organization strategy to avoid per-workspace limits.

2. Plan on an isolation strategy that will provide you with long-term flexibility without undue complexity.

3. Please consult with the data architects or SMEs to define logical data lake layers (Bronze, Silver, Gold)and, most importantly, robust data models across the various functions.

4. For more details on “Composable Data Architecture,” please refer to our earlier blog. (https://medium.com/@DataEnthusiast/open-data-architecture-at-scale-on-cloud-part-1-3381b411533f)

5. For more details on “Airflow on Kubernetes at Scale for Data Engineering (Dependencies Simplified),” please refer to our earlier blog. https://medium.com/@DataEnthusiast/airflow-on-kubernetes-at-scale-for-data-engineering-space-dependencies-simplified-f7646669739c

Hope you found it helpful! Thanks for reading!

Let’s connect on Linkedin!

Authors

Himanshu Gaurav — www.linkedin.com/in/himanshugaurav21

Bala Vignesh S — www.linkedin.com/in/bala-vignesh-s-31101b29

--

--

Himanshu Gaurav

Himanshu is a thought leader in data space who has led, designed, implemented, and maintained highly scalable, resilient, and secure cloud data solutions.