Data Governance Demystified: Explained in 5+1 Minutes for Immediate Clarity

Syed Fawad Ali
6 min readOct 28, 2021

--

What is Data Governance ?

Data governance increases data protection, trustworthiness, accessibility, usability and understanding[1]. It refers to the practice of identifying important data across an organization, ensuring its quality, enhancing its value, and making it reusable by people in the organization in an effective and compliant way.

The need and use of data governance is easily understood when you imagine you work for a huge data-driven organization where data is being produced, consumed, and shared across different teams and hundreds of thousands of employees. It can be very challenging to locate the right data, get access to it, understand its meaning, its quality, and determine how it can be used in a compliant manner — especially if you need to use data that you have never used before. This is where data governance comes into play.

Key Elements of Data Governance

There are five[2] key elements of data governance, see the figure. To understand these elements of data governance, I will go through each element step-by-step in order to explain them individually in detail in the following subsections:

Making data more findable and accessible

First and foremost element of data governance is to make sure that your data is easily searchable and accessible to the authorized users in a compliant way. Because when multiple teams and departments in your organization are ingesting data from multiple disparate data sources, as a consumer of this data it becomes very difficult for you to find the required data in the first place.

Data governance introduces the idea of the Data Catalog. As the name implies, a data catalog is an inventory of data assets in an organization. It enables metadata to be collected, organized, accessed, and enriched to support data discovery and governance. An enterprise data catalog can be effectively managed with many software tools. These tools offer many functionalities for describing which data is available and what it contains. That is, a catalog contains essential information (metadata) about the data, which helps its users to easily find and understand the data. For example, you would like to ingest an ‘Employee’ table — a catalog lets you extract the structure (a.k.a column attributes) of the table. Further, within the catalog, you can add more information about the characteristics and content of the data, such as, column types, business meaning behind the columns, etc. to make the data more understandable. More on it in the next sections.

Making data more understandable

As mentioned above, a data catalog enables its users to define and explain characteristics and content of its data. When you are dealing with lots of datasets across an organization, it becomes very difficult to know the business meaning behind each and every field in the dataset.

For example, consider yourself as a data analyst in a large customer centric organization where you are working on a task to compute Customer Value Index (CVI). In order to calculate the CVI, you need to import data from multiple tables associated with customer and customer behavior. However, some of the tables you have never used before contain some complex numbers, which you do not understand. In this scenario, the people who are the owner of this dataset and have knowledge of all the complex calculations behind the fields you are interested in, must have to document this valuable business information within the catalog. So that, when you use these tables you already know what each data field contains, what is the business meaning behind the field, and how you can utilize certain fields in the calculation of your CVI.

If everyone contributes, it becomes natural to fill a data catalogue with very useful information and bring it to life. In return, everyone benefits when they have to work with a dataset they have not worked before.

Making data more trustworthy and improving the quality of data

Now you can search and access the data using the catalog, as well as you understand different characteristics of data. At this point in time, the next question is: are you sure the data you are using is of high quality and is trustworthy? Because, if the data is of poor quality then your calculated Key Performance Indicators (KPIs) would provide you the wrong (or not so correct) information.

Thankfully, in data governance practices, there is a clear role of a Data Quality Manager, whose main job is to ensure that the quality of the most important data assets is regularly checked and problems are reported back to the data owners and data producers.

On one hand data quality is assessed in a very straightforward way e.g., by checking if there are any missing values, or duplicate values in a data set. Such anomalies can be resolved by implementing well known methodologies like imputation[3], entity-resolution[4] etc. On the other hand, there are logical and semantic errors in the data, which cannot be found easily at first glance. For that, Exploratory Data Analysis[5] can provide you with some outliers in your data set, which can then be verified by the subject matter experts.

Empowering data user with self-service

In Business Intelligence (BI) environments, self-service means empowering users to be less reliant on the IT organization and more self-sufficient. This mainly focuses on easy access to source data for reporting and analysis and easy-to-use BI tools and improved support for data analysis[7]. Self-service is basically where as many employees (whether expert data users or not) as possible have access to the high quality data, which they can combine, transform, and visualize with the help of easy to use BI and dashboarding tools to deliver insightful analysis[6].

As explained in aforementioned sections, making data searchable, understandable, accessible, trustworthy and of high quality are the key towards empowering the data user to use high quality and trustworthy data to draw insights from it without requesting it from the data owner again and again.

Ensuring data privacy and security

Data Security and Data Protection are two extremely broad and complex fields related to data. Data Security refers to the unauthorized use and access from unauthorized users. In order to ensure that data is secure, Authentication, Access Control, and Encryption may be used.

Authentication is the process of proving that the claimed identity is true, i.e., verifying that the person is who they claim to be.

Access Control limits the ways who can access what sort of data in what ways. For example, if you have a database containing customer data, then you would want to give certain users rights to access this database.

Using Encryption, data is hidden from unauthorized access by encoding it in a way that it cannot be decoded, i.e., data is protected from being read, manipulated or fabricated.

Regulations and practices that protect individual privacy are referred to as Data Protection, and they refer to how personal data is collected, stored, and processed. Several countries recognize an individual’s right to privacy as a fundamental human right. In this case, you must adhere to a comprehensive set of data protection rules and regulations, e.g., General Data Protection Regulation (GDPR).[8]

References

[1] Abraham, Rene, Johannes Schneider, and Jan Vom Brocke. “Data governance: A conceptual framework, structured review, and research agenda.” International Journal of Information Management 49 (2019): 424–438.

[2] Alexander Thamm, Michael Gramlich, Alexander Borek. The Ultimate Data and AI guide. Data and AI Press., 2020.

[3] Shrive, Fiona M., et al. “Dealing with missing data in a multi-question depression scale: a comparison of imputation methods.” BMC medical research methodology 6.1 (2006): 1–10.

[4] Getoor, Lise, and Ashwin Machanavajjhala. “Entity resolution: theory, practice & open challenges.” Proceedings of the VLDB Endowment 5.12 (2012): 2018–2019.

[5] Leinhardt, Samuel, and Stanley S. Wasserman. “Exploratory data analysis: An introduction to selected methods.” Sociological methodology 10 (1979): 311–365.

[6] Eckerson, Wayne W. Performance dashboards: measuring, monitoring, and managing your business. John Wiley & Sons, 2010.

[7] Alpar, Paul, and Michael Schulz. “Self-service business intelligence.” Business & Information Systems Engineering 58.2 (2016): 151–155.

[8] Voigt, Paul, and Axel Von dem Bussche. “The eu general data protection regulation (gdpr).” A Practical Guide, 1st Ed., Cham: Springer International Publishing 10 (2017): 3152676.

--

--

Syed Fawad Ali

AI enthusiast and data platform designer. Exploring creativity with tech. Follow for insights and everyday adventures