Data Governance and AI

A symbiotic relationship

Ravikiran durbha
Data And Beyond
4 min readNov 18, 2023

--

Source: Image generated by AI (MidJourney)

As AI continues to seep into our lives, I often wonder how it will impact the data warehousing and BI landscape and in this case, data governance, which is a big part of the landscape. While AI will certainly have an impact in this space, the reverse will also be true. Data is the raw material for AI and its governance will have a direct impact on how AI is adopted. There is a symbiotic relationship between them.

The Data Governance Market size is estimated at USD 2.73 billion in 2023, and is expected to reach USD 6.71 billion by 2028. With the exponential growth in AI, it may reach even higher. There is a surge in data usage across organizations and as the value of data grows, so does the need to govern it. With organizations trying to take advantage of recent developments in AI, data governance becomes more important as garbage in , garbage out rings more true with generative AI.

Back in 2018, Wall Street Journal declared that “Global reckoning on data governance” has arrived . Massive data breaches at organizations in numerous sectors resulted in serious reputation damage and declining market values for top brands such as Equifax, Facebook, Marriott and Yahoo. But, what does data governance entail?

Clearly, data security is a major part of governance, which means limiting access to data, but done well, it also means liberating it. Good data governance should lead to its democratization. After all, data is more valuable in the hands of people than it is locked away. To be sure, it is more valuable in the hands of right people. Ensuring that right people have easy access to data is then the goal of data governance. But, what does it take to secure data and democratize at the same time?

One of the central capabilities needed for governance is a catalog of all data assets. A data asset is not just what we store in our databases. These are the technical assets. We don’t need another tool just to catalog these. Business definition of the data generated by a process is also an asset. In fact, most organizations that lack data maturity do not capture this information very well. Mapping this to the technical assets is a wealth of information that improves the ease of access for decision makers.

Having the ability to search technical assets using business terms liberates data and democratizes it. On the other hand, we need to limit this access based on roles for data security. Policies that define this access are also data assets that should be part of the catalog. The key to implementing good data security is taking these cataloged policies and applying them automatically at the time of data access. This implementation should be flexible enough to either mask certain attributes in a record or mask the entire record.

A well implemented catalog is not just a great tool for decision makers but also a great source for AI which can aid decision makers with deeper insights. The reverse is also true. AI can be a great resource in implementing data catalog. In future, large language models (LLMs) can be fine tuned with the meta data (which includes the business terms mapping and the data definition of all the data assets) along with the requisite policies for access. This enables natural language search but it will have the ability to return results governed by the set policies. This is the key to secure data democratization as it further reduces the technical know-how to access information.

Another important aspect of governance is rules that define and govern data quality. While these rules are defined by business process experts they have to be written in a specific language (usually, structured query language or SQL). This enables the rules to be applied automatically in data pipelines to ensure quality of the data flowing through these pipes. Remediation, of course, is a very human process, which involves data stewards that understand the business process and can guide taking an appropriate action.

This increases the trust of decision makers consuming the data from these pipes. At the same time, it also has a huge impact on the quality of AI output consuming data from the same pipes. As with the catalog, the reverse is also true. AI can also have a profound impact on how data quality is implemented. Many data quality tools already implement AI in some capacity. Today, they are limited to inferring some technical rules (like data types, formats and empty values etc) by looking at historical data. It remains to be seen if AI can infer business rules as well. This may require understanding of causation and not merely correlation in historical data. The real impact would be with the LLMs though, which can automatically translate quality rules from natural language to SQL.

We have seen different components in data governance and the symbiotic relationship with AI. There are some we have not seen, like data lineage, which some may argue is outside the scope of data governance, but is extremely important nevertheless. Very often, though, it is missed that one of the main components of data governance is the data model itself. It is the first step towards governing data. With the advent of cloud, data lakes and schema-on-read patterns, data modeling took a back seat. The need to govern data is ushering it back to the front.

--

--