Run data science workloads without creating more data silos

Published in

SquareShift

2 min readJan 20, 2023

Google Cloud’s BigLake and Dataproc aim to support organizations in their data lake modernization journey by unifying data warehouses and data lakes and empowering distributed data science teams to run workloads on Apache Spark and other engines directly on the data lake while maintaining governance and respecting policies and access rules to break data silos and avoid data duplication.

Run data science workloads without creating more data silos

Organizations are complex, but your data architecture doesn’t need to be

BigLake unifies data warehouses and data lakes to create a centralized repository for storing, processing and securing large amounts of structured, semi-structured, and unstructured data.

while Dataproc enables distributed teams to run data science and data engineering workloads on the data lake, with access policies and governance defined in Dataplex, ensuring that each team has access only to the relevant data, and that sensitive information is protected. For example, a global consumer goods company uses BigLake to map file-based sales data to tables, apply row and column-level security, and manage data governance at scale through Dataplex.

Data Science with Dataproc on BigLake data

Dataproc allows customers to run data science workloads on Jupyter notebooks directly on the data lake using personal authentication, leveraging the governance and security features provided by BigLake and Dataplex, while allowing different teams to independently operate on their regional data stored in a unified data lake with policies and access controls defined by Dataplex on BigLake tables.

We’re a proud GCP data engineering partner. Read all about our GCP data engineering practice.

Run data science workloads without creating more data silos

Written by Prawin Selvan