Stories by Oscar Pulido on Medium

Iceberg Ahead! Navigating the Three Flavors of Iceberg on BigQuery

Oscar Pulido — Mon, 09 Jun 2025 21:51:40 GMT

Navigating the world of modern data architectures can often feel like a complex journey, and a recent customer evaluation I undertook really brought this to light. The goal was to build a flexible and future-proof data lakehouse on Google Cloud, and Apache Iceberg was the clear choice for the open table format. However, the moment we turned to the BigQuery ecosystem, we were faced with a dizzying array of options: BigLake Metastore, BigQuery tables for Apache Iceberg, and read-only Iceberg external tables. I know it can be painful to keep track of so many names and offerings that all seem to refer to the same thing. What’s the difference, and when should you use each one? In this post, I’ll break down the analysis, sharing a clear side-by-side comparison to help you choose the right path for your specific use case.

Side-by-Side Comparison

iceberg_in_BQ/iceberg_on_BQ-side_by.csv at main · oscarpulido55/iceberg_in_BQ

Preferred Use Cases

1. BigLake Metastore (BLMS)

BigLake Metastore is the recommended metastore on Google Cloud and is ideal for organizations building a true open-format data lakehouse where multiple analytics engines need to operate on the same data with a consistent view of the tables.

Interoperability: The primary use case is when you need to use both BigQuery and open-source engines like Apache Spark to read and write to the same Iceberg tables.[1] For example, a data engineering team can use Spark for complex ETL transformations to create or modify an Iceberg table, and a data analysis team can immediately query that same table in BigQuery without any synchronization steps.[1]
Centralized Metadata Management: It is best for scenarios that require a single, unified catalog for all open-format tables, eliminating metadata silos and simplifying data governance across different platforms like Dataproc, Spark stored procedures, and BigQuery.[1]
Migrating from Hive Metastore: BLMS is the target for migrating from a self-managed Hive Metastore (like Dataproc Metastore) to a serverless, fully managed solution on Google Cloud.[1]

2. BigQuery Tables for Apache Iceberg

This option is best suited for users who want the benefits of the open Iceberg format while retaining the simplicity and fully managed experience of native BigQuery tables.

BigQuery-Centric Workloads: Ideal for teams that primarily use BigQuery for their analytics, DML operations, and streaming ingestion but want their data stored in an open format for long-term flexibility or for occasional reads from other engines.[2]
Simplified Management: Use this when you want Google to handle all the complex table maintenance tasks, such as file compaction, garbage collection, and performance optimization, just like it does for standard BigQuery tables.[2]
Open Format with BigQuery Features: This is the preferred choice when you need to combine the Iceberg format with BigQuery-native features like time travel, seamless streaming with the Storage Write API, and automatic storage optimization.[2]

3. Apache Iceberg Read-Only External Tables

These tables are designed for querying existing Iceberg datasets that are managed and written to by external systems.

Querying Existing Iceberg Tables: The main use case is to provide BigQuery users with read-only access to Iceberg tables that are already created and maintained by other processes or engines (e.g., a Spark cluster running on-premises or in another cloud).[3]
Multi-Cloud Analytics: Perfect for when your Iceberg data resides in AWS S3 or Azure Blob Storage and you want to query it from BigQuery without moving the data.[3]
Fine-Grained Access Control: Use this to apply and enforce BigQuery’s granular security policies — such as row-level security, column-level security, and data masking — on your existing Iceberg data for consumption by BigQuery users.[3] It is the recommended approach, in conjunction with BigLake Metastore, to provide governed, read-only access to your open data lake.

So, while the number of options for Iceberg on Google Cloud might seem daunting at first, it’s actually a reflection of a mature ecosystem designed to fit different needs. There isn’t one “best” way — only the way that’s best for you. If you’re building a true multi-engine lakehouse, the BigLake Metastore is your central nervous system. If you love the simplicity of BigQuery but need an open format, the native Iceberg tables are a perfect fit. And if you just need to grant governed, read-only access to your existing Iceberg data, external tables have you covered. The key is to understand that these aren’t competing products but rather complementary tools in your data strategy. By aligning your architectural needs with the right offering, you can unlock the full power of an open and flexible data platform on Google Cloud.

Sources

Iceberg Ahead! Navigating the Three Flavors of Iceberg on BigQuery was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI-Powered Legacy Migration: Modularity, Human-in-the-Loop, and LLM Last

Oscar Pulido — Tue, 25 Mar 2025 15:51:48 GMT

AI is rapidly transforming application development. Tools like Google Gemini Code Assist and GitHub Copilot accelerate development cycles by assisting with code writing, providing real-time suggestions, ensuring consistency and clarity, aiding in code review to identify bugs and performance bottlenecks, and automatically generating baseline unit tests. In some instances, they can even write code from unit tests, speeding up feature implementation and application creation.

A different challenge arises when modernizing legacy applications for the cloud. Migrating code from outdated languages requires precise translation while preserving original functionality. Here, AI shifts from real-time assistance to a batch refactoring tool for large-scale code transformation.

One major hurdle of this approach is the output limit inherent in Large Language Models (LLMs). Generating large volumes of translated code is costly and increases the risk of the LLM deviating from the original logic, potentially introducing errors. Furthermore, when processing extensive codebases, the LLMs may struggle to maintain a comprehensive understanding of the relationships between different modules and functions, leading to inconsistencies in the translation.

Beyond output size limit and context, other issues arise, including the need for specialized training data for specific legacy languages, and the difficulty in automatically verifying the functional equivalence of the translated code. Addressing these challenges is crucial to unlocking the full potential of AI in modernizing legacy applications.

Chunking, iterative processing, and prompt engineering

Overcoming AI’s limitations in modernizing legacy applications requires innovative strategies. The main solution to LLM output token length limitations is combining chunking, iterative processing, and refined prompt engineering. LLMs rely on attention mechanisms that track relationships between tokens in their context window. This memory requirement, growing with input and output length, necessitates output limits and affects the model’s cost and capabilities. Longer outputs also risk coherence, as LLMs can deviate from the topic or generate nonsense, which output limits help prevent.

AI generated Image

While maintaining coherence over long outputs remains a challenge, chunking and iterative processing, coupled with a refined prompt with specificity, and clarity, can significantly improve results. Key strategies include focusing on functional equivalence for conversions, limiting the output size per chunk, and implementing mechanisms for handling outliers.

Divide and Conquer: Break down large code files into smaller, manageable chunks that fit within the LLM’s token limits. Chunking should be performed deterministically at the end of functional code blocks. Process each chunk independently, minimizing the risk of exceeding the context window or output token limit.
Iterative Refinement: Handle files of arbitrary size by processing them iteratively, ensuring each chunk remains within the LLM’s capabilities.
Prompt Control: Use well-crafted prompts to ensure the LLM returns only the requested converted and formatted code, eliminating extraneous information.

Architectural principles

To truly boost the migration process, AI applications leveraging techniques like Retrieval-Augmented Generation (RAG) and AI agents should be developed in a modular and in an Agentic workload fashion. The AI agents we build should be evaluated, function independently, or in a chain, using the output of some agents as input for others. This is crucial because LLMs perform best when they have complete context and the problem to solve is punctual and well-defined.

An agentic workload built on these principles needs to be framed within a revised migration strategy, one guided by the following architectural principles:

Modular: To maximize impact and address diverse use cases, accelerator designs must be as generic as possible. This modularity also aligns with the fact that AI achieves better accuracy when the problem to solve is highly specific.
LLM Last: Prioritize cost-effective and deterministic alternatives, such as NLP and traditional ML techniques, whenever possible. Only resort to LLMs when other options are exhausted. This approach reduces risk and increases the overall quality of the replatforming baseline generation by reserving LLMs for the tasks where their unique capabilities are truly essential.
Human-in-the-Loop: Avoid aiming for 100% automation. Instead, leverage AI to provide a strong foundation and solve the cold start problem. Recognize that AI cannot handle the entire task autonomously, but it can excel at predictable, well-scoped pieces of work based on carefully crafted prompts.

Thanks to Pradeep Bhattiprolu, and Aadila Jasmin who collaborated on this story.

AI-Powered Legacy Migration: Modularity, Human-in-the-Loop, and LLM Last was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Orchestrating Your Data Pipelines on Google Cloud

Oscar Pulido — Thu, 31 Oct 2024 18:51:01 GMT

In the ever-evolving world of data engineering, orchestrating data pipelines efficiently is paramount. Google Cloud Platform offers a rich set of tools for this purpose, each with its own strengths.

Domain-Based Orchestration V.S Central Orchestration

Before diving into specific tools, let’s address the strategies for data orchestration:

Domain-Based Orchestration

This approach advocates for decentralized ownership, where each data domain team manages its own pipelines and tools. This structure, inherent to the data mesh paradigm, can simplify IAM, networking, and potentially offer better control and agility for individual teams. However, it can lead to increased operational overhead and complexity, especially when managing dependencies across domains.

The Analytics Engineering Framework (AEF) offers a tangible example of this by deploying one Composer environment for each data orchestration domain team. Each domain team should manage a copy of the data orchestration repository and, consequently, its own dedicated Composer environment. This setup, while providing isolation and control, can exemplify the potential for increased operational overhead as the number of independent environments and repositories multiplies.

AI generated Image

Central Orchestration

This approach consolidates orchestration within a single project, account, or environment centralizing DataOps and potentially simplifying management. This approach may be easier to understand and manage, particularly in single-domain environments or those with shared networks. However, it may require careful IAM adjustments to ensure proper access control. Challenges faced in multi-tenant environments are explained here, including single point of failure, limited scalability, eventual need for logical groupings, opportunity cost and insufficient isolation.

Additionally, it’s crucial to select a tool that supports cost-effective horizontal scalability for your central orchestration implementation. If your chosen tool cannot scale to meet your organization’s growing needs, you may be forced to distribute the workload across multiple instances. This can complicate pipeline dependencies and lead to difficult decisions about how to divide the workload (e.g., by line of business, data layer, or load type). Such a scenario necessitates careful business alignment, further emphasizing the importance of choosing a horizontally scalable tool from the outset.

The AEF exemplifies this approach by deploying orchestration workflows within a single project for all data domain teams, leveraging the fully serverless and horizontally scalable capabilities of Cloud Workflows. In this setup, a central Google Cloud project houses all the orchestration workflows. Data domain teams, whether using multiple or a single data orchestration repository, write their data pipelines and deploy them to this centralized project.

Both approaches are valid, the choice between these approaches depends on your organization’s structure, data access patterns, and technical expertise. Factors such as the number of domains, data access patterns, DataOps strategy and networking configurations should inform the decision.

Cloud Composer: The Powerhouse for Data Pipelines

Built on the popular Apache Airflow, Cloud Composer is a fully managed service designed for complex data pipelines. Its Python-based DAGs offer flexibility and a rich ecosystem of operators for various tasks. Cloud Composer shines in scenarios requiring:

Complex Workflows: Handling intricate dependencies, branching logic, and diverse data sources.
Hybrid and Multi-Cloud Environments: Orchestrating workflows spanning on-premises systems and multiple cloud providers.
Large-Scale Batch Processing: Executing long-running, resource-intensive tasks.

However, Airflow and subsequently Composer has its limitations:

Scalability Limits: While Composer 3 introduces improvements, Airflow’s architecture inherently poses challenges for horizontal scalability and serverless operation. Optimizing DAGs for concurrency and scaling environments can be complex
Operational Overhead: Managing Composer environments, including DR management and backup, security configurations, and dependency management, can add operational complexity.
Cost: Each Composer environment incurs fixed costs, and optimizing Cloud Composer for cost-effectiveness requires careful consideration and ongoing effort. Composer 3 pricing is based on Data Compute Units (DCUs) but it keeps a cluster running in a tenant project even when no tasks are being executed.

Despite these limitations, Composer, particularly when adopted in a domain-based approach with potentially smaller, dedicated environments, remains a powerful tool for organizations with complex data orchestration needs.

Cloud Workflows: Serverless Simplicity for Service Orchestration

Cloud Workflows offers a serverless, fully managed platform for orchestrating services through HTTP-based APIs. Defined using YAML or JSON, workflows excel in scenarios requiring:

Service Integration: Chaining together microservices, APIs, and serverless functions.
Event-Driven Automation: Triggering workflows based on events like file uploads, database changes, or messages in Pub/Sub.
Cost-Effective Scalability: Automatic scaling based on demand, with a pay-per-use billing model.

While Cloud Workflows excels in service orchestration, it currently lacks some features crucial for robust data pipeline management, such as advanced retry mechanisms and data-aware triggers.

Bridging the Gap: The Analytics Engineering Framework (AEF)

The AEF emerges as a potential solution, by abstracting data pipelines into levels, threads, and steps, AEF simplifies pipeline definition and promotes best practices like CI/CD and a multi-repository strategy.

Importantly, AEF supports both Cloud Composer and Cloud Workflows, allowing you to choose the best tool for your needs. This flexibility enables a hybrid approach, leveraging Composer’s power for complex pipelines and Workflows’ simplicity for service orchestration.

BigQuery Workflows: Streamlining BigQuery-Centric Orchestration

BigQuery Workflows simplifies the scheduling and orchestration of tasks within the BigQuery ecosystem, which is rapidly growing into a single unified serverless data platform. It allows you to schedule SQL scripts, notebooks, and data transfers, offering a centralized location for managing BigQuery-centric workflows.

This growing ecosystem includes new capabilities like running Apache Spark and Flink jobs directly within BigQuery, further solidifying its position as a comprehensive data platform. With BigQuery Workflows, you can seamlessly orchestrate these diverse tasks, including SQL analytics, Spark processing, and Apache Kafka and Flink streaming pipelines, all within a unified environment.

Choosing the Right Tool: A Strategic Decision

Selecting the optimal data orchestration tool involves carefully considering your organization’s specific needs and constraints. Here’s a quick recap:

Cloud Composer: Ideal for complex data pipelines, hybrid/multi-cloud environments, and large-scale batch processing. Best suited for domain-based orchestration.
Cloud Workflows: Perfect for service orchestration, event-driven automation, and cost-effective scaling. Well-suited for central orchestration. Will require engineering to manage data use cases properly.
BigQuery Workflows: Simplifies scheduling and orchestration within the fast growing BigQuery ecosystem, ideal for BigQuery-centric workflows.

The AEF provides a valuable framework for implementing either domain-based or central orchestration, offering flexibility and promoting best practices.

Ultimately, the key is to choose a solution that aligns with your data strategy, technical capabilities, and long-term vision for your data platform.

Orchestrating Your Data Pipelines on Google Cloud was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop Thinking in Data Pipelines, Think in Data Platforms: Introducing the Analytics Engineering…

Oscar Pulido — Mon, 28 Oct 2024 22:04:34 GMT

Stop Thinking in Data Pipelines, Think in Data Platforms: Introducing the Analytics Engineering Framework

Imagine a world where you could deploy your entire enterprise-ready data platform in minutes and empower your data practitioners to independently write complex, end-to-end data pipelines in a standardized and scalable way, allowing them to focus on insights from day one.

This is the vision behind the Analytics Engineering Framework (AEF), a comprehensive, opinionated design and sample code for building and deploying robust, flexible, and cost-effective data platforms in Google Cloud Platform (GCP).

Consider using the AEF if you are looking for an enterprise-ready strategy and declarative approach that enables end data practitioners to build robust, scalable, and cost-effective data pipelines in GCP. Especially if you’re starting from scratch or looking to modernize an existing platform and you need:

To empower your data practitioners with a self-service model.
A practical, scalable data mesh implementation ready for deployment.
To simplify and standardize your data pipeline orchestration.
To reduce operational overhead and costs with a serverless-first approach
Improve the maintainability and reliability of your data pipelines using a solid CI/CD and repositories strategy.

The challenge with maintainable Data Pipelines

While BigQuery offers a comprehensive solution for modern data challenges by unifying data management, governance, and analysis within a single platform, building a robust DataOps strategy around it often feels like assembling a complex puzzle, requiring organizations to piece together various tools for IaaC, CI/CD, data ingestion, orchestration, and governed cataloging.

This complexity leads to substantial engineering costs for setup, maintenance, and training, creating a high barrier to entry for organizations seeking to leverage the power of data analytics on Google Cloud Platform while simultaneously adhering to software engineering principles for a maintainable platform.

The situation is further complicated by the often unclear ROI of data initiatives. Investing heavily in a complex platform can be risky, especially when the value derived from the data is uncertain.

Fragmented Development: Individual pipelines are built in isolation, leading to inconsistencies, redundancy and avoiding organized growing and fast new use cases implementation.
Centralized Bottleneck: Data practitioners often rely on data engineers to build and manage pipelines, creating a bottleneck and hindering agility.
Limited Scalability and Flexibility: Scaling and adapting pipelines to new use cases can be challenging and time-consuming.
Cost Inefficiency: Orchestration tools like Apache Airflow can be costly to operate at scale, some organizations need cost-effective alternatives.

Data teams often end with technical debt surrounding CI/CD, IaS, observability, and the least privilege principle. Establishing a foundational data platform that proactively addresses these potential gaps would empower teams to concentrate their efforts on building their data pipelines.

AEF: A Paradigm Shift in Data Platform CI/CD

The AEF addresses these challenges by introducing a paradigm shift in data platform development. Its core design principles are:

1. CI/CD and Declarative-Driven Approach:

AEF leverages a multi-repository strategy, with dedicated repositories for:

Orchestration Framework: Maintained by analytics engineers to provide seamless, extensible orchestration and execution infrastructure.
Data Model: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata.
Data Orchestration: Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps.
Data Transformation: Directly used by end data practitioners to define, store, and deploy data transformations.

This separation of concerns allows for independent deployment, scalability, and clear ownership of different platform components. Each repository should have its own CI/CD pipeline, enabling independent deployment and faster iteration cycles.

2. Embracing the Analytics Engineering Concept:

AEF is built on the principles of analytics engineering, empowering data practitioners to independently build, organize, transform, and document data using software engineering best practices. This fosters a self-service data platform where data practitioners can create their own data products while adhering to a federated computational governance model.

3. Agnostic to Orchestration and Processing Tools:

AEF is designed to be agnostic to the orchestration tool and data processing engine. While it provides sample orchestration code for Cloud Workflows, and Cloud Composer it can be integrated with other tools based on specific needs. This flexibility allows for seamless integration with existing systems and future-proofs the data platform.

4. Serverless-First Approach:

AEF prioritizes serverless technologies, leveraging the scalability, cost-effectiveness, and ease of use of services like Cloud Functions, BigQuery, and Cloud Workflows. This minimizes the need for long-term running servers, reducing operational overhead and costs.

5. Cost-Effectiveness:

By leveraging serverless technologies and providing a standardized framework for data pipeline development, AEF significantly reduces the overall cost of building and operating a data platform. This ensures cost-effectiveness and makes the platform accessible to a wider range of organizations.

https://medium.com/media/7d257a83154bc0d40f174f240efab716/href

Multi-Repository Strategy is Core to AEF

While data models are rarely changed due to their significant impact, they require robust schema evolution controls. Data pipelines, however, change more frequently and are developed by various personas. Reusable data transformations can also be shared across teams.

Additionally, core capabilities must be secured and standardized to ensure availability for multiple diverse teams while enabling a self-served data platform experience.

Therefore, segregation of responsibilities, independent repositories, and CI/CD pipelines are crucial for a scalable and robust data platform.

The multi-repository strategy and a robust CI/CD strategy are at the heart of AEF’s design, enabling:

Clear Ownership and Responsibility: Different teams can own and manage specific repositories, fostering a sense of ownership and accountability.
Independent Deployment and Scalability: Independent CI/CD pipelines for each repository allow for granular control over deployment and scaling of individual components that are released at different frequencies.
Isolation of Failures: Failures in one repository are less likely to impact other components, ensuring overall platform stability.
Easier Rollbacks: Changes can be rolled back more easily at a granular level, minimizing the impact of deployment issues.
Parallel Development: Multiple teams can work on different repositories simultaneously, accelerating development and fostering collaboration.

CI/CD plays a crucial role by automating deployment, and in that way enabling self-serve approach and minimizing errors. Additionally, it enables version control and rollbacks, allowing for easy recovery in case of issues. Automated tests can be integrated into the CI/CD pipeline, ensuring the quality and integrity of the code. By streamlining these processes, CI/CD facilitates faster iteration cycles, accelerating the development and deployment of new features and bug fixes.

Domain-Based Orchestration V.S Central Orchestration

Data orchestration is crucial for data lakes and warehouses. While Composer offers benefits, it doesn’t removes Airflow’s operational challenges. A serverless, cost-effective alternative using Cloud Workflows with automatic scaling and pay-per-use model optimizes resource usage, reduces costs, and accelerates time to value for some use cases.

The AEF offers flexibility to choose the better orchestration framework for each organization.

Domain-Based Orchestration: Isolating orchestration by domain potentially simplifies IAM and networking management but may lead to increased operational overhead and complexity. This approach is preferred in multi-domain environments with distinct data access and processing needs. This is demonstrated in the Data Orchestration repository, where one Composer environment is managed and deployed for each data domain team, with each team owning a copy of the repository.

Central Orchestration: Consolidating orchestration into a single project centralizes Data Ops and potentially reduces management complexity. This approach may be simpler to understand and manage, particularly in single-domain environments or those with shared networks. However, it may necessitate IAM adjustments. This is easily managed when using Cloud Workflows, as its serverless nature enables deployment within a single centralized project for all data domain teams.

Both approaches are valid, and the “definitive” guide can be adapted based on specific organizational requirements and constraints. Factors such as the number of domains, data access patterns, and networking configurations should inform the decision. Simplifying the development of complex data pipelines.

Data pipeline abstractions

One of the central components of the AEF is its approach to data pipeline orchestration. Recognizing the need for a simplified yet powerful orchestration abstraction, AEF introduces three core abstractions:

Steps: Represent individual data transformations, such as executing a BigQuery saved query, running a Dataflow or serverless Dataproc job, or even triggering a Dataform repository run.
Threads: Group a sequence of steps to be executed one after the other, enabling parallel execution of different sets of tasks.
Levels: Allow for a combination of sequential and parallel execution, with multiple threads running concurrently within a level and subsequent levels executing only after all tasks in the previous level are completed.

With these three simple concepts, declaratively defined as parameter files, end data practitioners can independently define complex data pipelines.

Translation to actual Cloud Workflows definitions or Airflow DAGs and subsequent deployment of these high-level orchestration definitions will be done as CI/CD pipelines steps.

Similarly, data practitioners will write simple parameter files to define their data transformations in the corresponding repository.

In the same way, the data model repository will store Dataplex metadata definitions to enable data governance, discoverability, and access control. Additionally, the data model repository will keep track of DDLs, schemas and datasets, including their deployment and version control.

Hands on

https://medium.com/media/20d03535775099f98778a36d7b93bf62/href

This demo lets you easily deploy and test the AEF. It includes everything you need to get started, including sample data and sources. You’ll see how to:

Populate sample data:
— Deploy mock on-prem JDBC data source DB.
— Populate sample GCS Mainframe files.
— Populate sample GCS CSV files.
Deploy the AEF: Clone repositories and run initial terraform deployments including sample declarative parameter files for:
— Data model DDLs and metadata.
— Data Pipeline definitions. (Examples for both Cloud Composer and Cloud Workflows).
— Data Transformations.
Run end-to-end data pipelines in the AEF: Using both Cloud Composer and Cloud Workflows:
— Extract data from different places: This includes mainframe files using Dataproc, and databases using Dataflow flex templates.
— Process the data: Use Biglake to read GCS files, join and transform data using dataform.
— Create a data product: Make the transformed data available in BigQuery.

This demo deploys the entire AEF in a single Google Cloud project and shows the complete lifecycle of a data product, from raw data to a final, usable table.

Conclusion

The AEF provides a blueprint for building modern, scalable, and cost-effective data platforms on GCP. Its multi-repository strategy, coupled with a robust CI/CD implementation, empowers data practitioners to independently build and manage complex data pipelines, accelerating time-to-insight and enabling organizations to focus on extracting value from their data.

Stop Thinking in Data Pipelines, Think in Data Platforms: Introducing the Analytics Engineering… was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ensuring data localization compliance on data movement between BigQuery regions

Oscar Pulido — Tue, 08 Aug 2023 13:37:34 GMT

Global organizations need to analyze data coming from the different jurisdictions they operate in to generate important insights for their decision-making processes. On the other hand, data protection laws like the EU’s GDPR (General Data Protection Regulation), or India’s PDPB (Personal Data Protection Bill) include data sovereignty regulations that need to be carefully considered.

While many regulations have data residency requirements that restrict moving PII data out of a given jurisdiction, this generally not apply to anonymized or aggregated data, allowing it to be globally analyzed.

This post will introduce some opinionated best practices to facilitate cross-border transfers in a privacy-safe and compliant way.

For this approach we will leave aside important infrastructure and networking considerations to focus on the data movement process including resource hierarchy, the required data access controls, and PII detection to mitigate regulation breaching risks.

Projects and Datasets

A robust hierarchy using multiple folders and projects to organize the data would facilitate organizational policies enforcement, operations (billing, logging, monitoring), as well as simplify the implementation of data access controls and inspection for PII data. In the next example we are only considering environment, region and data products but a production-ready deployment should consider hierarchies like Data Domains or Data Marts, based on the defined analytics strategy i.e. (DWH, DataMesh or even a Lakehouse).

Projects and Datasets for cross-region data sharing

Source Dataset: One dataset per jurisdiction, Source Dataset will depend on the data warehouse model defined, and it refers to denormalized data, a star model with fact and dimension tables, a data mart or any other analytics schema. You can also think of Source Dataset as the existing Dataset with the source data to be globally analyzed.
Isolated Dataset: One dataset per jurisdiction, holds locally aggregated data, could also be PII free data or anonymized data. Prefer new tables over Views/Authorized views to show clear evidence that no raw data is leaving perimeters, and to be able to use VPC Service Controls.
Destination Dataset: One global dataset in the selected location in which analytical users will run global analytics, it is the destination dataset for the data coming from all the Isolated Datasets in the different regions.

Source Dataset, Isolated Dataset, and Destination Dataset should each reside in different projects, providing the following advantages:

Additional IAM layer for security.
Can use VPC Service Controls to prevent the local analytic dataset from being API accessible from outside the region.
Prevent global analytical Users/Service Accounts from reading non sharable data.

Access Control Strategy

The end to end process of preparing global aggregated data should be achieved with 2 types of service accounts:

Source Service Account: One service account per jurisdiction that reads local analytical data and prepares local aggregate data.
Sharing Service Account: One common service account that can read the minimum necessary local aggregated data in all jurisdictions.

Service Accounts & required accesses

As the Source Dataset is part of the existing data warehouse model, existing users/processes will keep reading/writing to it. No other service accounts or user accounts should be given permissions to read or write data in the Source Isolated Dataset or in the Destination Dataset. Global analytics users should impersonate service accounts to prevent the end user from joining data with other local analytics data.

It is important to prevent the Sharing Service Account from reading non-aggregated/non-anonymized data (Source Dataset) in the local jurisdiction.

Cross-region Data Movement

Data is moved in two steps. Data ready to be moved should sit in a dedicated/isolated BigQuery dataset or in a dedicated/isolated GCS bucket. So as a first step the data should be moved from the source BigQuery dataset to a dedicated/isolated Cloud Storage bucket or BigQuery dataset. The second step is the actual cross region data movement.

Isolating data in the source region

Data should arrive de-identified or aggregated to the isolated dataset. It can be already de-identified in the Source Dataset or could be de-identified as it is moved to the Isolated Dataset.

Assuming your source data is already de-identified in BigQuery Source Dataset, you can move it to an isolated BigQuery dataset or to an isolated GCS bucket before taking it out of the region. Here some options to do that:

bq export is the simplest way to extract data from BigQuery tables to Cloud Storage.
You can also use Composer/Airflow and the BigQueryToCloudStorageOperator Airflow operator to move from BigQuery to Cloud Storage.
Table snapshots or table clones could be also used to move data between source dataset and isolated dataset within BigQuery without physically duplicating data and reducing cost.

If data will be de-identified as it is moved to the Isolated Dataset, then an ETL approach (Dataflow/Dataproc) will be necessary to move data to the Isolated Dataset.

Landing the data in GCS is preferred in order to leverage the Storage Transfer Service or GCS rsync for the following step.

Moving Data between regions

As of now (Nov 2023) there is no GA option to read data from BigQuery in one region and write it back to BigQuery in a different region directly. GCS will need to be used for staging in at least one region. We can move the data using no-code options like Storage Transfer Service (STS) or rsync or an ETL approach (Dataflow/Dataproc):

Use gsutil rsync for data sizes under 1TB, and use the default CMEK key set on the GCS source and destination buckets.
Use STS for data sizes above 1TB, it scales to larger data sizes, and supports transferring data to and from CMEK protected buckets.
Both Dataproc and Dataflow ELT options support setting a temporal bucket when moving data between different BigQuery regions.

GCS temporary bucket location to move data cross regions, applies for ELT options only. gsutil rsync and STS will always be from and to GCS.

There are two in preview (Nov 2023) offerings that allows you moving data between different regions directly in BQ:

BigQuery cross-region dataset copy allows you copying an entire datasets across regions without an ETL or moving data out of BigQuery, however there are several limitations because you can not move views, UDFs, External tables or CMEK encrypted tables, also appending data in the destination dataset is not supported and and the minimum frequency between copy jobs is 12 hours.
If the use case implies keeping a read-only replica in the destination region then Cross-region dataset replication could be considered as it is a simple setup configuration, however it is important to look at the limitations as this option is intended for for additional geo-redundancy, not cross region data sharing specifically.

In this two cases the presented organization hierarchy and access control strategy should be also considered to guaranty jurisdiction data isolation and secure data movement.

Mitigating cross-region PII data exfiltration risk

The DLP (Data Loss Prevention) service is a fundamental tool in ensuring PII data is not transferred outside the local jurisdiction. Inspection jobs and inspection templates can be used to publish tags in Data Catalog at a table level, or at a column level using a Dataflow job.

DLP inspection jobs inspecting sampled data in the Isolated dataset could run in batch to improve cost effectiveness.

To avoid delaying data movement, inspection jobs can run in parallel to the cross-region data replication jobs, ensuring you identify the PII data at least at the same time it is being moved, so you can stop a data movement or delete already transferred data based on alerts.

Once DLP identifies sensitive data, a policy tag can be automatically created to further restrict access depending on the content.

If orchestrated using Composer a Data Catalog Airflow operator could obtain entry details, including tags and values to be used as a control or validation step in the transferring pipeline.

PII validation step on Orchestrated data movement

Conclusion

At the time of writing this article, STS, gsutil rsync and bq export are no-code solutions to isolate and replicate data across regions that, in conjunction with the presented access control strategy, hierarchy design and DLP-based PII data identification process, provides jurisdiction data isolation to ensure data is moved in a secure way.

Thanks to Himal Dwarakanath, Daryus Medora and Julianne Cuneo who collaborated on this story.

Single User Jupyter Notebooks at Google Cloud

Oscar Pulido — Tue, 18 Jul 2023 14:25:39 GMT

professional-services/examples/personal-workbench-notebooks-deployer at main · GoogleCloudPlatform/professional-services

Enterprises need analytical users and data scientists to use their own identity (rather than generic service accounts) when querying and processing data on their experiments to make data usage monitoring and cost allocation easier in a governed environment.

Also notebook environments lifecycle automation is necessary to scale and serve enterprise level amount of users.

Data Scientists working with large amounts of data may need to run jobs in Spark or other distributed processing engines on Dataproc (managed Hadoop). For others, a Python kernel or single node Spark environment would be enough.

As a data platform central governance manager, you don’t need to provide analytical users with access to the GCP web Console, but to an on-demand self provisioned Jupyter environment.

The Terraform modules introduced here intend to help with the provisioning process of individual user analytical environments.

Google provides two Jupyter notebook-based options for your data science workflow:

Managed Notebooks

Managed notebooks are designed to manage provision, submission and decommission of resources via notebook instances running as Vertex AI managed VMs in a tenant project.

Identity impersonation: For Managed Notebooks to impersonate end-user identity when querying data across other GCP services (such as GCS and BigQuery), you can set Single User access mode to grant access to an specific user only, so they can login in the Jupyter environment using their own credentials.
Kernels: Managed notebooks are instances that can run Python, Spark standalone single node, R, and shell kernels.

Single-user Vertex Workbench AI Managed Notebook

The personal-managed-notebook module is intended to provide automation via Terraform to create individual managed notebooks for each end-user.

User Managed Notebooks / Dataproc Hub

User Managed notebooks allow heavy customization, and personalized images usage running as VMs in customer project.

Identity impersonation: For User Managed Notebooks / Dataproc Hub to impersonate end-user credentials when querying data, Dataproc clusters must have ‘Personal Cluster Authentication’ enabled.
Kernels: User Managed notebooks / Dataproc Hub are naive instances that allow users to create Dataproc clusters that can run heavy Spark jobs as well as Python kernels.

A specific type of User Managed Notebooks are Dataproc Hub instances, that don’t run JupyterLab but instead JupyterHub serving as a bridge for users to create Dataproc Clusters on demand and run JupyterLab there using an administrator predefined cluster template.

Dataproc Hub notebooks are administrator-curated notebooks running on a Dataproc JupyterLab cluster sit in the user’s project. Dataproc Hub helps on providing templated Dataproc notebook environments to users.

User-managed notebook / Dataproc Hub high level diagram

JupyterHub itself runs in a Notebooks instance that never hosts a Notebook server. The Notebooks instance only works to leverage the Inverting Proxy and provide a secure URL to JupyterHub to the users. When a user selects a template and creates a cluster, JupyterHub redirects the user to Dataproc Notebooks through the Component Gateway.

Main components:

JupyterHub: UI + Web Server for users to pick Jupyter notebooks server templates and start Notebooks server somewhere.
JupyterServer: Created as a Dataproc cluster by end-user from template defined by admin.
JupyterLab: Web-based user interface for project Jupyter.

Dataproc Clusters Templates

Dataproc cluster creation is triggered from Managed Notebook instance, by the end-user themselves, based on the YAML template that the Administrator makes available for them in a GCS bucket.

Using URL received from Admin, end-user will access JupyterLab

end-user will trigger cluster creation without accessing GCP console

Here we have a challenge, because we want to use Dataproc Personal Cluster Authentication (so that data is accessed using end-user credentials), the user email needs to be referenced in the YAML template file. This means we cannot have a single template for all users; instead, we need a template for each user.

For User-Managed Notebooks, the sample code includes automation to generate cluster template files for each user based on a given list, or just adding new module usages.

Self-service notebooks provisioning flow

To make the notebook environment lifecycle management a self service process, you can develop a Web UI to allow users request the environment creation. This could include choosing between a single instance that will translate into a Managed Notebook, or a distributed environment that will translate into a User Managed Notebook/DataprocHub instance.

Notebooks lifecycle automation pipeline

Once the user request is captured by the Web UI, a backend could generate a module invocation file and place it in a Terraform code repository.

Having a new Terraform file in the repository could trigger an automated CI/CD pipeline in charge of applying the Terraform changes, in this case a new Managed Notebook or User Managed Notebook instance creation.

Notebook instance deletion can be automated for Managed Notebooks using idle shutdown parameter, however for User Managed Notebooks it is a little more complex as it implies deletion of both the notebook instance and the Dataproc cluster.

Conclusion

For Data Scientists running single-node Python code, data platform administrators can rely on Managed Notebooks to guarantee end-user identity usage.

For Hadoop related workflows, Managed Notebooks are not the optimal solution, because Dataproc as external kernel makes cluster lifecycle management difficult and Dataproc/Spark Serverless doesn’t support Personal Authentication yet.

Dataproc Hub via User Managed Notebook, provides an alternative for Data Scientists to create their own Dataproc Clusters, using admin-curated templates that ensure end-user identity usage via Personal Cluster Authentication.

Thanks to Daryus Medora who collaborated on this story.

Single User Jupyter Notebooks at Google Cloud was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dataproc lifecycle management orchestrated by Composer

Oscar Pulido — Wed, 19 Apr 2023 17:04:16 GMT

Ephemeral Dataproc or EMR Hadoop clusters are infrastructure in the Cloud, however, we usually do not want to manage them as a part of the infrastructure provisioning CI/CD pipelines via a mix of tools like Terraform, Jenkins or GitHub Actions, but as a part of the data pipelines orchestration via Airflow or Composer.

Keeping the cluster lifecycle management as a part of the data pipelines orchestration allows the data pipelines to take advantage of the elasticity and scalability properties of the Cloud, as well as to use resources in a cost efficient way by sharing clusters between multiple jobs and by scaling already provisioned clusters instead of creating new ones. Overall jobs processing time reduction is another advantage of sharing and scaling the ephemeral clusters to run multiple jobs, as the cluster provisioning time is zero for the second and subsequent jobs running in a cluster.

Imagine you have multiple Spark jobs to be scheduled and run in Dataproc clusters. Internals of each Spark job are unique and related to a specific use case, but it is common that execution parameters are similar, changing only the values between them. Here it makes sense to dynamically generate DAGs based on a template.

I have put together a very simplistic code to illustrate those concepts:

professional-services/examples/dataproc-lifecycle-via-composer at main · GoogleCloudPlatform/professional-services

It is a terraform template that will create a Composer environment and a folder structure in the code that will read Json parameter files to generate Airflow DAGs and deploy them in the Composer environment.

Using this project you can deploy multiple Airflow DAGs which means you will create configuration files for DAGs to be automatically generated during the deployment.

main.tf
...
dags/        (Autogenerated on Terraform Plan/Apply from /dag_config/ files)
├── ephemeral_cluster_job_1.py
├── ephemeral_cluster_job_2.py
jobs/
├── hello_world_spark.py
├── ...      (Add your dataproc jobs here)
include/
└── dag_config
   ├── dag1_config.json
   └── dag2_config.json
   └── ...   (Add your Composer/Airflow DAGs configuration here)
...

Each DAG will have a task step to run a Dataproc Job referenced in the parameters file, and that Job will be executed in a Dataproc Cluster.

{
    "DagId": "ephemeral_cluster_job_1",
    ...
    "SparkJob":"hello_world_spark.py"
}

The Dataproc Clusters can be reused for multiple Jobs/DAGs, and you can think of it as a queue. If you want two DAGs sharing a cluster, you only need to set the same cluster name parameter in both configuration files.

{
    "DagId": "ephemeral_cluster_job_1",
    ...
    "ClusterName":"ephemeral-cluster-test",
    ...
    "SparkJob":"hello_world_spark.py"
}

The Dataproc cluster lifecycle management will be done by the automatically generated Airflow DAGs to reuse or create clusters accordingly. The cluster proposed configuration includes a scalability policy that allows it to scale out if multiple Jobs are running in a single cluster at a specific moment.

resource "google_dataproc_autoscaling_policy" "dataproc_autoscaling_policy_test" {
  project = var.project_id
  policy_id = var.dataproc_config.autoscaling_policy_id
  location  = var.region
  worker_config {
    max_instances = 5
  }
  basic_algorithm {
    yarn_config {
      graceful_decommission_timeout = "30s"
      scale_up_factor   = 0.5
      scale_down_factor = 0.5
    }
  }
}

This approach aims to use resources efficiently meanwhile minimizing provision and execution time.

Prerequisites

This blueprint will deploy all its resources into the project defined by the project_id variable. Please note, that we assume this project already exists.
The user deploying the project (executing terraform plan/apply) should have admin permissions in the selected project, or permissions to create all the resources defined in the Terraform scripts.

Project Folder Structure

main.tf
...
dags/          (Autogenerated on Terraform Plan/Apply from /dag_config/ files)
├── ephemeral_cluster_job_1.py
├── ephemeral_cluster_job_2.py
jobs/
├── hello_world_spark.py
├── ...        (Add your dataproc jobs here)
include/
├── dag_template.py
├── generate_dag_files.py
└── dag_config
   ├── dag1_config.json
   └── dag2_config.json
   └── ...     (Add your Composer/Airflow DAGs configuration here)
...

Adding Jobs

Prepare Dataproc Jobs to be executed

Clone this repository
Locate your Dataproc jobs in the /jobs/ folder in your local environment

Prepare Composer DAGs to be deployed

3. Locate your DAG configuration files in the /include/dag_config/ folder in your local environment. DAG configuration files have the following variables:

{
    "DagId": "ephemeral_cluster_job_1",     --DAG name you will see in Airflow environment
    "Schedule": "'@daily'",                 --DAG Schedule
    "ClusterName":"ephemeral-cluster-test", --Dataproc Cluster to be Used/created for this DAG/Job to be executed in
    "StartYear":"2022",                     --DAG start year
    "StartMonth":"9",                       --DAG start month
    "StartDay":"13",                        --DAG start day
    "Catchup":"False",                      --DAG backfill catchup
    "ClusterMachineType":"n1-standard-4",   --Dataproc machine type to be used by master and worker cluster nodes
    "ClusterIdleDeleteTtl":"300",           --Time in seconds to delete unused Dataproc cluster
    "SparkJob":"hello_world_spark.py"       --Spark Job to be executed by DAG, should be placed in /jobs/ folder of this project. (if other type of Dataproc jobs modify dag_template.py)
}

4. (Optional) You can run python3 include/generate_dag_files.py in your local environment if you want to review generated DAGs before deploying(TF plan/apply) those.

Deployment

set Google Cloud Platform credentials on local environment: https://cloud.google.com/source-repositories/docs/authentication
You must supply the project_id variable as minimum in order to deploy the project. Default Terraform variables and example values in varibles.tf file.
Run Terraform Plan/Apply

 $ cd terraform/
 $ terraform init
 $ terraform plan
 $ terraform apply
##Optionally variables could be used
 $ terraform apply -var 'project_id=' \
-var 'region='

Once you deploy terraform plan for the first time and Composer environment is running, you can terraform plan/apply after adding new DAG configuration files, to generate and deploy DAGs to the existing environment.

First time it is deployed, resource creation will take several minutes (up to 40) because of Composer Environment provisioning. You should expect successful completion along with a list of the created resources.

Running DAGs

DAGs will run per Schedule, StartDate, and Catchup configuration in DAG configuration file, or it can be triggered manually trough the Airflow web console after the deployment.

Dataproc lifecycle management orchestrated by Composer was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.