Core Concept Deep Dive: Governing Constellations

Published in

Data Cloud Architecture

9 min readJul 16, 2024

This post is part 3 of our 4 part series of deep dives into each of the Data Cloud Architecture core concepts. To read our first and second parts in the series, click here or here.

As the AI revolution has hit the technology industry, Data Governance has had a renaissance in terms of how it’s viewed by an IT organization. As a former consultant, I can tell you that even as recently as 5 years ago, Data Governance was the 3rd rail of data consulting, never something any client wanted to pay for or even invest in internally. Why invest in Data Governance when you can just lock the data down and only surface relevant insights to your downstream data consumers? A central IT function loves having that level of control over the data, and data discoverability is a bad word to a traditional IT function. But as AI has brought data discovery to the forefront of an everyday business function, IT has had to rethink how it governs its data assets, and let the business have better discoverability and access to the data required to innovate. IT can no longer be the gatekeepers to all data assets, and instead must pivot, or be accused of stifling innovation within an organization. So how can an organization spin up a governance framework that is flexible enough to support the ever-changing needs of the business while still maintaining privacy and legal requirements? By focusing on methods of collaboration and providing a single front door to all data discoverability.

The Data Cloud Architecture was designed to support a flexible governance and deployment framework allowing an organization to focus on the development and deployment of high value assets on top of a collaborative and scalable infrastructure for each defined business entity. This allows an organization to maximize the value of the assets under their control, dividing development efforts, and allowing everyone to profit from a shared working model, with IT focused on optimizing the overall infrastructure. And in order to govern this Data Cloud, the framework has defined the term “constellation” to be a group of business entities who are all collaborating under the governance definition of a single “trust relationship”. Let’s take a deeper dive into how we think about constellations and trust relationships, and how they impact technical decisions in your Data Cloud.

Defining Your Constellation

When building out your Data Cloud and thinking through your governance structure, the first thing you need to consider is what business entities are covered under this single trust relationship. For many companies, you may end up with at least two constellations, one for internal collaboration and one for external collaboration. But there is theoretically no limit to the number of constellations that an organization can have or participate in. Though, the more constellations an organization participates in, the less streamlined your collaboration will be. One of the key differences between the Data Cloud Architecture and a Data Mesh is that rather than having hundreds of data contracts for collaboration on specific data products between individual nodes in the mesh, a Data Cloud only requires that you follow the rules of the trust relationship to collaborate on an asset. So defining as broad of a constellation as possible can maximize that free flow of data and applications among business entities.

The governance model must extend to every entity within the constellation, so everyone has to be on the same page in terms of how best to secure every asset within the constellation, otherwise there is no implicit trust. One example of where this trust could break down is with the catalog. For every constellation, there should be a single data dictionary or catalog where all assets are published to. If there is no trust, there may be business entities who refuse to publish their assets for fear of that information not being handled in a way that everyone isn’t comfortable with. This type of behavior is counter-productive to the main premise of the Data Cloud Architecture, which is to increase the flow of data and applications, and maximize the ROI of each asset. So, how do you make sure that there’s trust across a constellation? By defining a robust and scalable trust relationship for your constellation.

Defining the Trust Relationship

There are a few different areas that need to be considered when defining a trust relationship for your constellation. First and foremost, it cannot be stressed enough how important control of infrastructure is. For every asset, there will ultimately be a decision about how best to deploy into a consuming Business Entity’s environment. So an understanding of how each Business Entity runs its own infrastructure will help a broader constellation define a trust relationship that can support the level of agnosticity that has been established across the organization. This can be even more challenging for external constellations, as there are any number of customers that you could target who have all made different infrastructure choices. So designing asset deployment and access patterns that are open and accessible to a wide variety of infrastructure patterns is critical when defining your trust relationship. Thankfully, there are really only 2 core deployment patterns that need to be designed around: programmatic access or physical instantiation. Let’s walk through an example of how these two patterns work with a physical asset.

Let’s say an organization has built a Customer Master through their MDM team (part of their central IT Business Entity). There are 6 other Business Entities in their internal constellation who all need to access this Customer Master asset in their own infrastructure to combine with transactional, product, or manufacturing data in order to create new, internal assets for their core business. As the owning Business Entity, it’s on the Central IT Business Entity to communicate with the consuming Business Entities the way they are distributing the asset. If they are going for programmatic access, they would make the metadata of the asset available to the consuming Business Entities along with either credentials or the API for them to query the data themselves. If they are going for physical instantiation, they would work with the consuming entities to either let them know where they will drop the physical data for the consumers to pick up, or ask them for a landing zone to replicate into. It’s on the owning Business Entity to make their decisions based on how best to maximize the ROI of the asset it’s making available. If the value of the asset is directly tied to how real-time the data needs to be, it might make sense to only allow programmatic access to the asset, and work with the consuming entities to make sure the programmatic access patterns work for them. If the asset’s value is more tied to a consuming Business Entity’s ability to manipulate the data in the asset for their own purposes, physical instantiation might make more sense, and the owning Entity can have the same conversation regarding what works best for the consuming Entity.

Example Trust Relationship

Keeping the above concepts in mind, let’s create an example of a trust relationship. Let’s say we have a company who wants to create an internal constellation with 6 different business entities. 3 of the entities are on Snowflake in AWS, 2 are on AWS native services in us-east-2, and 1 is on Azure native services in east-us. Each entity is capable of generating their own logical and physical assets, and are dealing with minimal PII in the assets being generated. So, for physical assets that require physical instantiation, the trust relationship should be defined as below:

IF Owning Business Entity (OBE) = Snowflake THEN
  IF Consuming Business Entity (CBE) = Snowflake THEN Replication to CBE Snowflake account
  IF CBE ≠ Snowflake AND CBE = AWS THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE ≠ Snowflake AND CBE = Azure THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog AND Azure connectivity for loading
ELSEIF OBE = AWS THEN
  IF CBE = Snowflake THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE = AWS THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE = Azure THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog AND Azure connectivity for loading
ELSEIF OBE = Azure THEN
  IF CBE = Snowflake THEN Iceberg Table published to ADLS Gen2 with Metadata in Polaris Catalog
  IF CBE = Azure THEN Iceberg Table published to ADLS Gen2 with Metadata in Polaris Catalog
  IF CBE = AWS THEN Iceberg Table published to ADLS Gen2 with Metadata in Polaris Catalog AND AWS connectivity for loading
END

As you can see, by creating Iceberg Tables in a central location, the physical data can be easily copied into the consuming Business Entity’s environment for usage regardless of what technology they are working with. It puts the responsibility on the consuming entity to manage their own ingestion, rather than on the owning entity to create complex ETL pipelines themselves to get data pushed into bespoke technology environments, simplifying the publishing of new assets and maintaining interoperability with many different types of solutions. For physical assets that require programmatic access to the underlying data, the trust relationship should be defined as below:

IF OBE = Snowflake THEN
  IF CBE = Snowflake AND OBE Region = CBE Region THEN Private Listing
  IF CBE = Snowflake AND OBE Region ≠ CBE Region THEN Private Listing with Listing Auto-Fulfillment
  IF CBE = AWS THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE = Azure THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog AND Azure connectivity to AWS for querying
IF OBE = AWS THEN
  IF CBE = Snowflake THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE = AWS THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog
  IF CBE = Azure THEN Iceberg Table published to AWS S3 with Metadata in Polaris Catalog AND Azure connectivity to AWS for querying
IF OBE = Azure THEN
  IF CBE = Snowflake THEN Iceberg Table published to Azure ADLS with Metadata in Polaris Catalog
  IF CBE = AWS THEN Iceberg Table published to Azure ADLS with Metadata in Polaris Catalog AND AWS connectivity to Azure for querying
  IF CBE = Azure THEN Iceberg Table published to Azure ADLS with Metadata in Polaris Catalog
END

In this example, you can start to see how Iceberg tables provide the ultimate flexibility and interoperability for a constellation with regards to collaboration on physical assets. Regardless of whether you need programmatic access to the data or you need that data physically instantiated into your environment, Iceberg tables allow you to do either option. When it comes to Snowflake to Snowflake collaboration on physical assets, however, it’s even simpler: replication or data share, and Snowflake will handle the rest of the process for you.

For logical assets, the same basic principles apply: do you need programmatic access to the code, or do you need to physically instantiate the code. This really boils down to are you providing some level of API access to the logical asset, or are you containerizing the asset and distributing it to the other entities in your constellation for them to run on their own infrastructure.

Trust and Governance

For many organizations, the biggest issue they have in order to maximize the value of their data assets is trust. Does the business trust the data being served up to them, do analysts across organizational boundaries trust the work being done by other analysts, and are the assets being delivered actually adding value to the organization. The DCA defined the governance structure of this framework with the concept of a “trust relationship” on purpose. The goal of the DCA as a whole is to drive better collaboration on the assets that bring the most value to an organization, and the basis of that is trust in the data and in each other. If there is no trust, then the value of the asset cannot be fully realized. By defining and governing your data estate through trust, an organization can truly break down the silos that restrict the free flow of data and applications across an ecosystem. By adopting the DCA framework as their core data strategy, an organization can drive better trust, better governance, and ultimately, better value out of their data assets.

Core Concept Deep Dive: Governing Constellations

Defining Your Constellation

Defining the Trust Relationship

Example Trust Relationship

Trust and Governance

Written by James Anderson