The Role of Data Governance in The Era of AI

Dios Kurniawan
Life at Telkomsel
Published in
6 min readFeb 5, 2024
Photo by Author

Like COVID-19 virus, Generative AI (“GenAI”) applications based on Natural Language Processing (NLP) have been sweeping across the world in the past one year. Businesses are joining the race to reap the benefits of GenAI, attempting to improve business processes, or to incorporate advanced features into their products. Large Language Models (LLMs) are increasingly adopted to streamline tasks in various domains: sales, customer support, marketing, HR, legal, and many more. According to McKinsey study¹, by 2060, up to 50 percent of today’s tasks will be replaced by automation.

Although GenAI has only recently seen exponential adoption, the concept behind it is not something new. It is no different from traditional AI, which always needs three key elements:

  1. algorithm
  2. computing power, and
  3. data.

It Is About Data

AI relies heavily on data. AI models take in data, learn patterns from this data, and adjust their internal wiring to make them capable of generating output according to the data they are trained from.

Most AI use cases with machine learning (ML) techniques involve processes which sift through massive datasets, either structured or unstructured, to produce prediction, categorization, or recommendations. Compared to traditional IT systems, AI systems need data which is more dynamic, keeps evolving in term of variety, and is large in term of volume.

Considering growing AI adoption, it is becoming clear that the demand for data in organizations will skyrocket; building strong enterprise data pipelines to support AI development is an obvious business imperative. As a matter of fact, AI is a data product.

But beware: you need to feed AI systems with high quality, accurate data to train and test models, as well as to produce results. If you miss to implement strict data quality measures, then your AI systems might be trained with invalid datasets. As a result, they can produce inconsistent, inaccurate or biased outputs.

For example, to turn systems based on LLMs into customized expert systems for a specific topic, a technique called RAG (Retrieval Augmented Generation) is usually employed by combining external data repositories. Because this is done on-the-fly, supplying the LLMs with low quality data might directly result in disastrously wrong response (also known as hallucinations).

For most businesses, AI systems are only as good as what data is supplied to them, much less about the algorithm. No AI application would work without good data.

Governing The Data

It is important not just to manage AI development projects, but also it is crucial to govern the data put into those AI systems.

This is where Data Governance comes into play.

Data Governance sits between the three key elements and AI applications

DAMA Data Management Framework² specifies data quality as one of ten pillars of essential data management practices, with Data Governance at the center of it.

Unlike traditional applications such as dashboards and reporting applications where problems in data can be seen immediately, poor data quality in AI and ML systems is harder to detect. A robust data governance helps companies to set up policies, procedures and standards for data quality checks, even in early stage of the process. Data engineers need to be well equipped with guidelines to design comprehensive data quality mechanisms along the data pipeline for AI systems.

Data is the key to AI success and, as such, implementing Data Governance should be top priority for all technology leaders.

Standardization of data would be crucial as well. Metadata management, another pillar of data management practices, is needed to understand the origin, sensitivity, and lifecycle of the data being used within AI applications. In some use cases, it might be mandated by law to provide transparency of AI systems, highlighting the need to have data lineage to track where the training data came from.

Using external data to train AI models or to include in the RAG process, is also known to introduce intellectual property risks. Data Governance standards of tagging and filtering incoming data might help avoid legal issues.

Privacy and Security

Another issue which is common in AI implementation is the risk associated with sensitive data. Corporate data asset or personal data can easily slip into training datasets, and the model could leak sensitive information. For example, your company could buy ChatGPT and build a custom model that is trained on your sales data, with the goal of helping employees to get answers to questions like “What are the highest selling products last month?”. If you are not careful in preparing the datasets, the AI system might randomly return the identity of the customers in the response. Privacy of your customers will be at risk.

Exposing corporate data to cloud-based AI providers also introduces another risk for companies. It is no secret that those AI providers use the data for their own benefit, improving their products.

Because of this, Data Governance has been considered as a key attribute of GenAI evaluations, which should be minimally be tested prior to deployment of LLMs to ensure safety. Evaluation includes assessment of the tendency of the LLMs to regurgitate training data in their outputs³.

Regulatory Compliance

Data Governance security policies such as masking and anonymization of personal data will play a pivotal role in preventing an organization from running into situations where data protection and privacy laws might be breached.

Likewise, there are times when data for AI systems must be destroyed when the data does not fit the model anymore. Governance which includes written policies and procedures on data operations and storage are often needed for compliance with laws such as the General Data Protection Regulations (GDPR).

Article 10 of the European Union AI Act specifically states appropriate data governance as the key requirement to development of high-risk AI systems. In the future, it might be illegal to run AI systems without auditable data governance.

Bringing AI to Real World

Data Governance will play even more important role when your organization is beginning to operationalize AI products and services. Policies and standards help organizations navigate the complexities in providing stable stream of reliable, high quality data. Without it, GenAI implementation would not get off the ground.

MLOps (Machine Learning Operations), a set of practices which aims at improving AI / ML deployment and operation processes, has lately gained lots of traction in the industry and is worth considering for any organization adopting AI. It is closely aligned with Data Governance goals, as governance itself is the backbone of MLOps.

Data governance is essential for AI adoption in your organization. But unlike computing power which can be easily bought, data governance is not something you can actually buy. It must be gradually grown and nurtured in your own organization.

Data Governance costs money, but the benefits in the long run simply outweigh the costs.

According to an IDC study⁴, there will be a dramatic increase of up to 40% in IT spending on AI-related initiatives within global companies in 2025. Going forward, it will make much sense for business leaders to invest more on data governance initiatives to ensure successful AI adoption.

Many companies build GenAI prototypes which may look nice during demos, but when it comes to building production-level GenAI applications that really work, I believe only few — those who understand the importance of data governance — will be successful.

¹ McKinsey Report, The Economic Potential of Generative AI: The Next Productivity Frontier, 2023

² DAMA Data Management Body of Knowledge (DMBOK), 2017

³ Cataloging LLM Evaluation, Infocomm Media Development Authority Singapore, 2023

⁴ IDC FutureScape: AI Will Reshape IT Industry and The Way Businesses Operate, October 2023

--

--

Dios Kurniawan
Life at Telkomsel

Big Data Analytics, Data Warehousing, Machine Learning, Software Development, Data Governance, Privacy and Protection