Cloud-Native Data Governance by Design

Willem Koenders
15 min readOct 23, 2022

--

Retracing Steps: Historic Pain Points and High Costs

Data governance programs generally come with a connotation of regulatory pressure, high costs, and an unclear return on investment. Programs to identify critical data, manage metadata, control data quality, and evidence data provenance have typically been long and expensive. The related annual costs easily exceed $10M per year at leading US banks, at times topping $100M. Execution is painful and slow, as thousands of data elements are to be manually identified across hundreds of systems and applications, often created in preceding decades.

Perhaps the most elusive of all is data lineage. Some vendors have managed to create tooling that can scan systems and harvest metadata, but they typically fail to connect to the majority of the existing systems landscape. As data flows have typically not been structurally and consistently documented across the enterprise, they are to be compiled relying on tribal knowledge and manual mapping efforts. The situation becomes even more grave where this tribal knowledge has left the enterprise as business owners and contractors alike have moved on over the years.

Moreover, even where business and technical metadata have been documented as part of concerted remediation programs, the majority is rapidly outdated as metadata capture and documentation is not automated. Keeping it up-to-date requires sustained manual effort.

Finally, organizations struggle to democratize this metadata to the rest of the enterprise for business purposes other than data governance itself. Many larger organizations have a Data Strategy that articulates, in some form, that the data management foundations should also power business purposes such as data science — but very few succeed in realizing this in practice.

The advent of cloud-based technology brings promises of scalability, elasticity, lower costs, rapid deployments, and enhanced compatibility of data technologies. Working with a variety of organizations on their cloud migration and data modernization programs over the course of the last few years, patterns emerged on how data can be managed smartly from the moment of design.

For example, interoperability standards can be defined for APIs so that their future logging and governance in data lineage visualization is automated. Data quality controls can be created based on a consistent data model and embedded directly into the new infrastructure. The knowledge that is present during transformations is not lost or degraded, as critical information about data elements and their provenance are painlessly logged in a data catalogue.

The remainder of this article will further elaborate on the concept of Data Governance by Design and explain how it can be achieved through enabling capabilities such as Data Assets, Data Management Hubs, and an API-driven Architecture through a data layer called the Data Fabric.

The Framework

In a services-based architecture, microservices connect the organization’s business processes. In such an architecture, four foundational components can together enable Data Management and Data Governance by Design. The below provides an illustration — it depicts an organization with 5 departments (Risk & Compliance, Finance, Marketing, Customer Mgmt., and Product Development). Each department creates and maintains several data products, with a subset of them classified as “data assets” because they are used also by consumers from other departments. Between the different departments, various APIs exchange data and information. All of this is orchestrated against the fabric of data foundations (“data fabric”), with the Data Management Hub scanning and providing data mgmt. services. Let’s take a quick look at each of these components.

Data Assets — Each domain or department typically produces data or information that is consumed by other domains. For example, a Customer Management domain may gather data on customers through onboarding and customer relationship management processes to produce and maintain a central database with customer information. This database could be used by the Marketing team to execution sales campaigns and the Risk department to confirm compliance with data privacy legislation. Such data products have been given different labels, with Data Products, Data Assets, Trusted Sources, Authoritative Data Sources, and Systems of Record being some of them.

API-driven Architecture — Different teams are connected through APIs as the preferred (or only) method of data integration. Maintaining an API-first mindset ensures that any critical data can be made available to consumers internal and external to the organization.

Data Management Hub — A set of data management capabilities is provided in a separately provisioned space or environment. These capabilities include Metadata Mgmt., Master and Reference Data Mgmt., and Data Quality. Instead of having to build these capabilities within each area of the organization, they can be deployed from the central Data Mgmt. Hub without big additional investments.

Data Fabric — The threads that provide the connectivity across all systems and applications is called the Data Fabric. It is not a magic layer that can be instantiated out of nothing; rather, it is comprised by a set of enabling capabilities and carefully considered governance protocols that together enable that information across the enterprise is discoverable, catalogued, classified, labeled, quality-controlled, and accessible through common interoperability standards and channels.

Using Data Assets to Govern and Rationalize the Systems Landscape

What is a Data Asset?

A Data Asset is set of prepared data (and hence specifically not raw data) that is ready to be consumed by a wider set of consumers. It is governed, labeled, quality-controlled, and accessible. It is discoverable and described (therefore “self-explanatory”), so that self-service is enabled for consumers across the enterprise. Data Assets are typically reused across the enterprise and owned within a given data or business domain.

Why are they critical?

Given that Data Assets are used by a large group of consumers, it is a very logical location to implement data quality and governance controls. In that governed asset, the content is labeled, and data quality is tightly controlled, so that instead of identifying and measuring this data throughout the enterprise, which often results in inconsistent “versions of the truth”, there is a trusted distribution point for a given dataset. For example, across the top 10 US banks, programs have been mobilized to identify these critical data sources and govern them. Typically, around 20 to 100 Data Assets will enable the organization to control all of its critical data — a much more efficient approach than trying to define data quality across 1000s of individual systems.

Data Asset Adoption

Creating and defining the Data Asset is not enough. A required and equally important step is to govern their usage, because if consumers are not using them, they are not benefiting from the centrally controlled data quality. For this reason, many organizations have mobilized various versions of Data Asset adoption programs. Typically, this includes on the one hand a socialization effort to spread awareness of the Data Assets and the benefits of their usage, and on the other hand a compliance standard and mechanism that mandates that data can only be consumed from Data Assets and not from anywhere else.

Business Users and Impact

It has been a historical struggle for data organizations to articulate the value they add to the enterprise beyond hard-to-measure (although very credible) claims such as the avoidance of regulatory fines. Data Assets are a game-changer here. A data source can only become a Data Asset if there is downstream critical consumption, so it is recommended to document these consumers and their use cases.

Building a simple overview of the Data Assets and the use cases that depend on them allows for a clear articulation of the impact generated by these assets. Impact assessments can be executed much more efficiently as well, by gathering the downstream requirements for the data and evaluating how the data can be controlled and enhanced within the trusted distribution point. In one marketplace example, a leading insurance party was able to measure relatively precisely how an enhanced set of customer data enabled them to more easily execute and increase the effectiveness of sales campaigns.

Without the identification and active management of Data Assets, the result is usually an incomprehensible “spaghetti” of data flows, with instances of data duplication and inconsistency (left). Using Data Assets strategically, groups of use cases can be identified that use unique reusable curated data sources.

Data Asset Governance

A governance model is required to embed Data Assets into the Data Fabric to ensure it is not flooded with “bad” data, where “Data Mesh” is the principal approach on how to embed this governance into the organization.

Data Mesh

A relatively new term is the “Data Mesh”, which is an approach of enabling business domains to manage their critical data at the point of (or very close to) where this data is captured and maintained, backed by a central self-service data infrastructure. This contrasts with past efforts, where organizations tried to centralize their critical data in data lakes and data warehouses. Such centralization efforts were typically troubled by undue expectations on central data teams, that did not have the business-specific context to understand the data and as a result could not keep up with the required pace by their consumers. “Dirty” data lakes were a common symptom.

Governance Topologies

Some business or functional domains might be ready to manage their critical Data Assets straightaway, but others might not. Following the example of the banking industry, domains that are typically relatively mature include Risk and Finance as they have years of experience complying with regulatory data governance guidelines.

Allowing domain owners to produce Data Assets for the consumption of others has several enabling requirements. First, the enabling domains must have a minimum level of required skills and experience in terms of data management and engineering. Second, domain owners must have the required amount of bandwidth within or outside the team, or budget to delegate part of the responsibilities. Maintaining Data Assets is typically not a fulltime job, but at the same time it does imply a significant responsibility.

Based on such considerations, the enterprise can opt for a particular Governance Topology (see box below), where each comes with a set of pros and cons. Organizations can choose for one topology but grow into another one over time.

Governance Recommendations

The following recommendations can standardize and de-risk the implementation of any of the topologies:

  • Embed the concept of Data Assets into the enterprise change methodology
  • Formulate and adhere to a set of design principles that include governance of the Data Assets
  • Insist on a design principle that Data Assets must be derived directly from confirmed, authoritative sources
  • Define an authoritative Enterprise Data Model with clearly outlined domains
  • Maintain a central catalogue of Data Assets
  • Insist on minimum set of required metadata (possibly made available centrally), including classification and other security-related metadata, to enable role-based access
  • Define and enforce discoverability and interoperability standards

API-Driven Architecture and Interoperability Standards

Embedding an API-first mentality alongside clearly defined interoperability standards are key to ensuring governance and control of future data flows and driving automated data lineage capture, avoiding massive future manual mapping efforts.

API-driven Architecture

In an API-first infrastructure, different teams are connected through APIs as the preferred (or only) method of data integration. Maintaining an API-first mindset ensures that any critical data can be made available to consumers internal and external to the organization. If done correctly, it should also drive compliance with global standards such as Open Banking Standards, opening opportunities to collaborate with strategic partners.

Interoperability Standards

Interoperability standards are comprised of a set of rules and protocols to drive the interaction and data exchange between different systems and applications. If we use the analogy of electricity, you can buy any sort of appliance, from refrigerators to Christmas lights or phone chargers, and typically expect that you can hook it up to an outlet within your house (within a given geographic region). It is similar for data — you want to make sure that your data (the electricity) is offered in agreed qualities and quantities through standard outlets that are available to anyone who has access to the different rooms in the house. For your enterprise, you want to agree on the types of outlets and the channels through which data is brought to them.

There is no one set of interoperability standards that is the right answer for every organization, but several dimensions or components are critical:

  • Adherence to a data model to ensure consistent use and interpretation of data, at least for a minimum set of critical data (for banks, BIAN can be a good starting point)
  • Standard messaging and payload formatting
  • Minimum business and technical metadata that is identified, maintained, and provided in standardized formats alongside systems and applications
  • An identified set of compatible technologies

Set of interoperability tooling

Having a consistent set of interoperability standards should drive that any sort of (existing or future) compatible technology is able to exchange data with your infrastructure. For adoption purposes, it is recommended to identify at least 1, but possible a few, API technologies that data engineers can use for their respective purposes.

The choice for what technology depends on the organization, the targeted business outcomes, and the already available tech stack and corresponding expertise. One regional retail organization decided to adopt MuleSoft as their go-to API platform for the build-out of their digital organization, whereas a leading US manufacturer opted to create their own, inhouse built, API capability.

Industrializing Data Lineage Through APIs

“Data Lineage by Design” through APIs

A huge opportunity exists for organizations to ensure that data management is incorporated by design in future infrastructure by adopting an API-first mindset:

  1. API patterns can be defined to accommodate future needs for data and information flows. Example patterns are asynchronous, synchronous, orchestration and data processing, and event-driven patterns.
  2. Across patterns, align on metadata scripts or files that are made available alongside the API. These scripts should be standardized and contain a minimum set of business and technical metadata, such as the source, the destination, the frequency of the feed, included (critical) data elements, and a selection of indicators (e.g., classification, PII indicator). Best practice is that these metadata files are updated (if possible, automatically) each time the API is updated, and maintained in an API catalog.
  3. Ensure that the API metadata files are pushed or pulled into the metadata mgmt. tooling (specifically, the data catalogue) in place, so that lineage diagrams can be created.

The combination of driving APIs as the principal means of data exchange between systems with insisting on minimum standards drives “data lineage by design.”

Note: Don’t overcomplicate the metadata files. Small sets of critical metadata are preferred over an exhaustive set with unclear business value. Exceptions aside, as a default there is no need to include detailed data element-level sourcing and transformation logic.

Data Management Hub

Centrally provisioned but locally adopted data management capabilities enhance the consistent definition, governance, and protection of data at a lower cost compared to post-implementation, manually executed governance exercises.

The critical importance of the Data Mgmt. Hub

To drive Data Management by design, it is critical that a separately provisioned hub is created and made available that contains minimally required data capabilities. This hub contains data mgmt. capabilities that should be referenced and embedded as part of any future build-out of cloud-native business or functional processes. Instead of having each transformation program or business/functional area reinvent the wheel in terms of how to ensure that master data is properly used, metadata is managed, and data quality is monitored, they should be able to self-service these capabilities from the Data Management Hub.

So why is it “by design”?

The vast majority of traditional data management investments have historically been spent “after the fact”, that is, post-implementation. Business processes are discovered, data elements are identified, business requirements are inferred, and data quality is measured based on existing infrastructure implementations, requiring an enormous manual effort and sustained discipline.

In the outlined approach here, these data management considerations are embedded before and during the phases of design and implementation. Moreover, the otherwise later on manually executed governance steps are integrated as functional, non-functional, and technical requirements as part of the design. As the solution is implemented, therefore data management is built “by design”.

Example “by Design” Capabilities

Data Catalogue — As outlined above for APIs but applied more generally, the discovery, documentation, and visualization of the systems landscape in terms of the applications and the data flows between them can be automated.

Data Quality — Controls that monitor and ensure data quality and integrity can be embedded in two principal ways. First, specific controls and restrictions can be applied at data creation, capture, and transport. Examples include restrictions on the accepted or valid values, and reconciliation checks in data flows. Second, in strategic locations such as in Data Assets, quality controls can be implemented on data at-rest to measure completeness, accuracy, and timeliness.

Master and Reference Data — Very specific examples of Data Assets, master and reference data are very powerful levers to drive the consistent use of data that is used repeatedly at transactional levels. A common starting point is Customer MDM, to ensure that throughout the enterprise, the right customer data elements are used across processes such as onboarding, transactions, customer contact, marketing, and relationship management. Similarly, providing easily accessible reference data, such as postal address standards, will drive their adoption.

For a technology-specific application of this concept, see Microsoft’s Data Mgmt. Landing Zone, as part of their Cloud Adoption Framework.

The Data Fabric as the Cloud-Native Glue Between Systems

The threads that provide the connectivity across all systems and applications is called the Data Fabric. It is not a magic layer that can be instantiated out of nothing; rather, it is comprised by a set of enabling capabilities and carefully considered governance protocols that together enable that information across the enterprise is discoverable, catalogued, classified, labeled, quality-controlled, and accessible through common interoperability standards and channels.

To a large extent the fabric is enabled by the previously described components of Data Assets, APIs, and Data Management Hub. If used correctly and to their fullest extent, these should form the dominant fibers of the data fabric.

But there are several complementary, if not minimally required, data capabilities:

Data Pipelines, Ingestion, Preparation, Transport, Provisioning, and Storage — Where APIs cannot fulfil the job, alternatively or complementary data delivery and integration option can ensure that data is gathered, ingested, transformed, curated, and made available depending on the business or functional requirements. Storage needs to be provisioned to persist data.

Data Orchestration — Depending on the targeted use cases and business processes, additional data orchestration can be applied to take data from various sources, combine and integrate it, and expose it to data analysis tools. Data orchestration can be executed at IaaS- or PaaS level or using technology that abstracts away the infrastructure-level activities, such as Apache Airflow, Prefect, and Snowflake.

Data Security and Protection — Process of monitoring and ensuring sensitive data is not lost, misused, or accessed by unauthorized users, and enabling capabilities to proactively secure data assets. Policies and standards should dictate how data is to be protected and how it can be shared. Identify and Access Management (IAM) can facilitate role-based access, and a variety of network and authentication protection measures can protect the data from unauthorized access and manipulation.

Reporting, Analytics, and Data Science — One or several Data Platforms can be created to cater to reporting or analysis use cases. With Data Assets, APIs, and Data Mgmt. Hub in place, this becomes a straightforward exercise as the data is available, understood, and easily ingestible, and in a cloud-native environment the respective reporting or data science tooling can be activated on-demand without big upfront investments or contracting considerations.

Success Factors

Let’s close this article with a few thoughts on success factors. Organizations that have shown success have typically first focused on a few selected domains, engaging business stakeholders from the start, ensuring that benefits in terms of compliance with regulations and policies are considered together with business outcomes, and insisting on a cloud-first design principle.

  1. Start Small Start with 1 or 2 Data Assets in a domain that is already relatively well-organized, with Client and Product data typically being strong candidates. It is easier to manage success on a smaller scale and use the gathered momentum to drive implementation of lessons learned in other domains.
  2. Business Engagement Include Business representatives from the start. Value creation depends on their adoption and consumption, which is why it is critical to ensure that relevant requirements are incorporated into the Data Assets and Data Fabric, in terms of what data is needed and how it can be accessed.
  3. Benefits Consolidation — Organizations with strong success stories were typically able to combine historic data governance responsibilities with more forward-looking data science-related use cases, clearly articulating how strong data foundations will serve stakeholders across the enterprise. The return on investment is more convincing if data mgmt. by design drives regulatory compliance as well as business-oriented, insight-driven use cases.
  4. Cloud-First — Insisting on a cloud-native design prevents vendor lock-in and allows for de-risked, no-regret experimentation, avoiding high upfront investments, with the ability to “fail quickly” and to scale up/out in case of success. Additionally, a cloud-first design principle will aid the discovery, nurture, and recruitment of talent through proven training and certification.

References

For icons, https://www.flaticon.com/ has been used.

--

--

Willem Koenders

Global leader in data strategy with ~12 years of experience advising leading organizations on how to leverage data to build and sustain a competitive advantage