Rise of the Policy Catalog

Published in

Inside Machine learning

6 min readMay 30, 2017

by Seth Dobrin, Mandy Chessell, and Richard Hogg

As the volume and variety of data continues to explode, CDOs across industries are increasingly focused on data governance to enable the maximum value to be derived from the use of trusted data.

The term ‘data governance’ is of course deceptively simple. In reality, it refers to multiple, interlocking, co-dependent categories of data management, organizational collaboration, and policy — from data ownership to data supply chains, and from data protection and legal holds through to compliance with regulations — all of which must converge to enable business analytics and machine learning teams.

More and more, the survival of the organization hinges on whether CDOs and their teams can get those various categories of data use, data management, and data policy to align. Failing to do so could mean hobbling teams with untrusted data, falling foul of regulators, and replicating data through uninformed strategies across the organization. But alignment is only the first step. For data that is accurate, current, controlled, accessible and truly capable of driving business outcomes, CDOs have to aim for automation.

Why automation is essential

Regardless of industry, data scientists, engineers and analysts are demanding self-service access to data. They need to locate the right data and then have fast and direct access to this data to spur initiatives and insights that will move their organizations forward.

Achieving self-service access to data for experts is already a challenge, but as we continue to commoditize artificial intelligence and wrap it in well-designed, cognitive tools, we’ve seen the same demand for self-service data permeate other roles in the organization. We can expect it to extend eventually to consumers as well.

Growing data and growing demand could crush any governance infrastructure that isn’t automated, accessible, and engineered to scale. This failure in the governance program could expose the organization to misuse of information, loss of data, and prosecution for data privacy breaches.

At the same time, the move toward automation has to include a careful, upfront data classification effort that links to entitlements by role. There’s no room for error when it comes to safeguarding sensitive health, education, and financial data — and when it comes to proving compliance to regulations, including any preservation duties from legal e-discovery, audit, or regulatory investigations. That’s hard enough for mission-critical data, but any classification effort must also account for third party data, legacy data, data from mergers and acquisitions, and new influxes of web, log, and sensor information.

To address that mesh of requirements, CDOs need a single policy catalog that can unify governance, enforcement, and verification across the entire organization, and offer data access via thoughtfully designed APIs.

The catalog defines the interactions with external governance repositories such as, LDAP (Lightweight Directory Access Protocol) for security, CMDB (configuration management database) for governance, and services like Git and Jira for the software development life cycle. We also know that open standards and open source code are key to achieving agility, interoperability, scale, and pace. In particular, we’re finding that Apache™ Atlas offers a great foundation for governance since it allows for the creation of a scope of metadata for particular communities of users and systems that can then be integrated through open interchange formats and APIs. (Stay tuned for a post where I’ll dive deeper into Atlas and its advantages. I’ll look at how policies translate into principles, standards, regulations, approaches, and rules for implementation.)

A unified policy catalog provides a defensible, authentic source across all stakeholders regarding…

Who: Owners and custodians of data and data sources
What: The data and data types in the system and what they mean
Where: The various sources storing the data
What Purpose: The duty and usage of the data — in other words, its business value and any usage decisions
Why: Sets of internal policy decisions as well as policy decisions for conforming with external regulations and mandates, including retention, privacy, residency, and so on

Figure 1 depicts a specify policy catalog example from the insurance industry.

*Figure 1. Policy catalog foundation example*

With policy elements in place, linked through the classifications to the data source descriptions, the catalog can provide a launchpad for any stakeholder across the business. But that doesn’t simply mean that stakeholders can access and use the data correctly. Access and correct use are the foundation, but the implications are broader …

Policy catalog as a launchpad

It’s clear by now that designing and implementing an effective policy catalog takes a long, careful effort albeit one that delivers incremental value at each iteration. So, it’s worth taking a moment to consider just how transformative it can be for organizations as the benefits begin to multiply. Data is increasing seen as a shared asset. As data moves laterally through the organization, along its data supply chains, responsibilities for managing data are clarified and this transparency creates greater trust to share and use data.

A well-designed policy catalog serves as the foundation of an ecosystem by connecting to an open-source metadata backbone. Relying on open-source code and architecture means lower costs, improved flexibility, security, and freedom. Add in the machine learning capabilities that are increasingly woven into open-source projects and you begin to have a policy system that’s on the lookout for opportunities to improve its own reliability, efficiency, classification process, and automation.

Once in place, machine learning and automation have the potential to free up resources that can be devoted to improving self-service. Data engineers and data scientists — just like the general public — are insisting on user experiences that are more intuitive and more delightful. It won’t be enough to create clear, unified rules for data access and correct use if the interactions with the system confuse or discourage stakeholders from making the most of the data they can reach.

Single and unified — but shared

Saying that the policy catalog needs to be single and unified shouldn’t imply that it needs to be based on a single repository that is controlled autocratically by a single team. Just the opposite. We can’t expect a single team to understand the entire stack of data issues across the organization, nor can we expect a single tool to handle all types of data and their management requirements. Instead, the policy catalog has to rise up out of collaborative consensus across teams — and has to be able to delegate cleanly between policy domains.

Each organization will find its own way. But for a policy catalog to enable the automation that organizations need to survive, CDOs, COOs, CIOs, and the rest of the executive suite will have to bring their teams to the table to hammer out common approaches where needed. Typically, teams will want to focus on the top-priority data sources first, prove the value of implementing a policy catalog, and then expand the program. But eventually the effort will mean cataloging all data sources and understanding of how new data comes on board. It will mean creating honest, ego-free data classification schemes that are consumable and built to scale. And it will mean identifying and defining the value of decisions that the organization needs to make — and aligning the decisions to the data. (See my previous post about creating an enterprise-wide data science system.)

As complex as the process can be, the real determiner of success is culture: Can executives urge teams toward policies that reduce bottlenecks, guarantee compliance, and serve the future of the organization at large? Inevitably, increasing the level of automation requires increasing the level of trust in the policies that make the automation possible. That’s no easy task, especially in large organizations.

Back to machine learning

Again, this work matters primarily because it enables data-driven decision making and analytics — and increasingly machine learning is the engine of analytics. As you build your policy catalog and the rules for delegation between domains, keep the focus on enabling your machine learning teams. Strive to give them the power to ask and answer big questions — and they can make your business thrive.

For more about IBM’s data governance innovations, visit our Data Governance and Information Lifecycle Governance portals.

For more about our efforts around machine learning, reach out to us at our Machine Learning Hubs or visit us at the Data Science Experience. We’d love to continue the conversation.

Rise of the Policy Catalog

Why automation is essential

Policy catalog as a launchpad

Single and unified — but shared

Back to machine learning

Written by Seth E Dobrin, PhD