An overview of GCP Dataplex
The aim of this article is to give readers an objective overview of GCP’s Dataplex solution. I hope it will allow you to make an informed decision when choosing it as a tool to govern your data mesh.
Before introducing Dataplex, I’d like to provide some context around the problem that it solves.
In an ideal world, every user in the company has timely access to valuable insights derived from high-quality data. The data comes in the format they need, delivered via the tool they are familiar with. The security and access to data are tightly controlled to prevent unauthorised data access or leakage and to reduce risks associated with regulations such as GDPR. The data engineering teams don’t spend many hours building data pipelines and managing data silos. Instead, they spend it on delivering insights that impact decisions and move the business forward.
However, the reality is not so great. Many enterprises are struggling to maximise the value of their data, with two-thirds of data produced never being analysed (According to the Dataplex team). This is due to data not being easily accessible across multiple silos, as well as a growing number of people and tools across the organisation. Companies often face the need to move and duplicate the data from the silos. This enables analytics, reporting, and ML use cases, but also leads to data duplication, movement, and the possible degradation of data quality as a result. On the other hand, some companies leave the data distributed, limiting the agility of decision-making.
This is when the concept of data mesh comes in. It’s a form of data architecture that acts as a middle ground between complete data democratisation and siloed ownership. The concept first proposed by Zhamak Dehghani in her paper in 2019 defined it as technology agnostic and scalable across the organisation. It assumes the democratisation (or decentralisation) of data and ownership by domain. Examples of such domains could be Orders, Shipments, Transactions, Inventory, Customers etc. The data is owned by specific teams and treated as a product. Each domain contains data sources relevant to a particular business area, as illustrated below:
You can learn more about the data mesh architecture from the following blog posts written by our experts at Credera:
How to implement a data mesh architecture
Pocket guide: Realising a data mesh architecture
What is Dataplex?
So, how do we allow the use of siloed data without the need for movement or duplication whilst maintaining the ownership and correct permissions across datasets and domains? Google’s answer to this is Dataplex. Defined as “Intelligent Data Fabric” by GCP, it allows you to organise your data lakes, marts, and warehouses by domains, enabling the data mesh architecture. In addition, it allows monitoring, governance, and data management. In terms of governance features, Dataplex has some similarities to AWS’s Lake Formation.
Dataplex provides a useful layer of abstraction of the data storage sources by using the following constructs:
- Lake: A logical construct representing a data domain or business unit. For example, to organise data based on group usage, you can set up a lake per department (such as Retail, Sales, or Finance).
A new lake can easily be created via the Dataplex console:
- Zone: A sub-domain within a lake, useful for categorising data by stage (e.g. landing, raw, curated_data_analytics, curated_data_science), usage (e.g. data contract), or restrictions (e.g. security controls, user access levels). Zones are of two types: raw and curated.
We can create multiple zones within each lake either based on domain or data-readiness. The example below shows the zones by domain — e.g. fooddata.
- Raw zone: Data that is in its raw format and not subject to strict type-checking.
- Curated zone: Data that is cleaned, formatted, and ready for analytics. The data is columnar, Hive-partitioned, in Parquet, Avro, Orc files, or BigQuery tables. It undergoes type-checking to prohibit the use of CSV files, for example, because they do not perform as well for SQL access.
The below example illustrates zones by different stages of data readiness:
- Asset: An asset maps to data stored in either cloud storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone.
Within each zone, we can add multiple assets that link to specific data sources. In the example below, we have added a BigQuery data set as an asset:
Entity: An entity represents metadata for structured and semi-structured data (table) and unstructured data (fileset).
The visualisation below shows an example of a Dataplex set-up consisting of a single lake within the Sales domain. It consists of three zones for raw, offline sales, and online sales data. Each of the zones contains data assets linked to specific data sources e.g Cloud Storage, BigQuery, or other GCP databases. Despite the fact that those sources are scattered in different storage systems or even GCP projects, the team members can easily access them via SQL Scripts, Jupyter Notebooks, or Spark jobs. In addition, automated data quality checks, data pipelines, and tasks are included within this data domain.
The advantages of Dataplex
Dataplex provides a single pane for data management across data silos and allows us to map data to different business domains without any data movement. Here are some additional advantages of the tool:
- Because the data doesn’t need to be moved and duplicated, you can store it in different data sources within GCP in a cost-efficient manner. This includes data being stored within different GCP projects. The data can be logically organised into business-specific domains.
- The tool allows you to enforce centralised and consistent data controls across the data sources within Dataplex Lake at scale. It also provides the ability to manage reader/writer permissions on the domains and the underlying physical storage resources. At the same time, it enables standardisation and unification of metadata, security policies, governance, and data classification. You can apply security and governance policies for your entire lake, a specific zone, or an asset.
- Dataplex integrates with Stackdriver to provide observability, including audit logs, data metrics, and logs.
- You are able to use GCP’s built-in AI/ML capabilities to automate data management, quality checks, data discovery, metadata harvesting, data lifecycle management, and lineage.
- Metadata management and cataloguing allow the members of a domain to browse, search, and discover relevant data sources. The metadata is made available via integration with GCP’s Data Catalog. The illustration below shows the search tab within Dataplex.
- Integration with open-source tools such as Apache Spark, HiveQL or Presto.
- Integration with GCP tools such as Cloud Storage, BigQuery, Dataproc, Dataflow, Data Fusion, Data Catalog or GCP’s Notebooks. The example below shows how we can query the data within Dataplex lake using BigQuery and load it in Notebooks.
The disadvantages of Dataplex
Whilst there are many advantages associated with Dataplex, it also carries some disadvantages:
- GCP-only solution: Dataplex does not integrate with other large cloud vendors like AWS or Azure in the current release. This means that we cannot include data sources outside of GCP in our data mesh. This is an inconvenience for some large organisations that are using a multi-cloud approach.
- Lack of maturity: Dataplex currently supports Google Cloud Storage and BigQuery. However, it is expected that this list would grow as the tool is still relatively new. Furthermore, adding assets from all GCP regions is not available yet.
- No on-prem integration: Many large organisations are pursuing a hybrid cloud model, meaning that some of their data is stored on-prem. Dataplex currently doesn’t allow integration with those data sources into the data mesh and is only limited to the ones stored in GCP.
Conclusion
Despite the fact that Dataplex doesn’t support non-GCP cloud providers, on-prem data sources, or all GCP data sources, it is still very valuable for organisations using the GCP tech stack. The main benefits of centralised governance, integration with GCP tools, and AI/ML capabilities render it a great tool for enabling the data mesh architecture.
Helpful resources
You can learn more about Dataplex via the following links:
Interested in joining us?
Credera is currently hiring! View our open positions and apply here.
Got a question?
Please get in touch to speak to a member of our team.