Govern your data: it’s a tough job, but someone has to do it!

Published in

Quantyca

10 min readJun 23, 2020

Daenerys Targaryen governing her data. *Spoiler alert: she failed.* (“Game of Thrones”, season 8 episode 6, 2019)

I thought many times to the moment when this would have happened: writing my first post on Medium. Before we begin, let me introduce myself.

I am Daniele, I’m a tlc engineer (network specialization) and I’m currently working at Quantyca, Data @ Core, an Italian IT consulting firm, as Junior Data Engineer.

In this article I will try to show you the benefits of the Data Governance and, at least, the main initial steps to deal with it trying to do my best to not bore you 😇

Data governance (DG) is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage.

Even if its definition sounds important, from our experience here in Italy, it is commonly snobbed and nowadays very few companies cares about it and, nevertheless, in many of them it is usually left as the last thing to care about.

Wrongly.

Brief introduction to Data Governance

We have just seen its definition so let’s try to investigate a bit deeper.
Data Governance needs to be a continuous process. Why?
It includes naming conventions, business terms standards and a lot of guidelines and best practices about them so, if tomorrow or during next week a new business word will be used in your company, the DG needs to take care of it.
From another point of view, if a new table (or, more realistically, many of them) is created in your company database, the DG needs to take care of it and let everyone interested in knowing what kinds of data it contains and who is accountable for them.

As you can see the Data Governance has its fingers in many pies and those pies are not useless at all.
The question comes out automatically:

why is she (DG is a lady, don’t you know?) so much underrated?

I have talked about guidelines and best practices, but who are them for?
Everyone in the company. The Data Governance starts from people, without all workers contribution it would not be possible to begin and realize a governance strategy.
It is a long and difficult process that must be tailored to each company/reality; also, it has to reconcile many actors opinions on definitions, terms and best practices, not exactly the easiest task to do.
Here relies the main difference with all others topics treated by the different teams inside the company: there can be the Data Engineering team that manages data pipelines and tools, there can be the Data Science team that works on data analysis and very challenging algorithms, there can be the Sales team that tries to sell company products to customers, we could certainly mention many others, but all of them need to cooperate guided by the Governance team to make this process real.

If all this does not happen the whole governance process remains smoky, it’s not a secret.

Data Governance starting pillars: Business Glossary and Data Catalog

When a newly born process of Data Governance learns to walk, its first two steps have well-defined names: Business Glossary and Data Catalog.
They are two sides of the same coin.

Harvey Dent (Aaron Eckhart) in “Batman: The Dark Knight” (2008)

The Business Glossary is a document that aims at consolidating and sharing a common enterprise business dictionary. The document is business-oriented and should describe concepts as they are referred to by the business staff. The business people, having knowledge of the company and supervised by the governance team, must fill, iteratively, such dictionary.
Each term will also be associated with other valuable metadata like synonyms, metrics, lineage, business rules, etc.

In our approach, this document has two levels of description: entities and attributes. An entity represents a logical concept that is meaningful for describing the organization business; an attribute is a specific logical information that is almost always used to characterize the entity.

Are you able to see some similarities? No?
What if I make this substitution:

logical → physical

Are you able to see it now? Of course you are, it is the other side of the coin, the Data Catalog.

It is a document that aims to share the physical organization of data across the company. It is structured in three levels:

systems: physical objects responsible to store or process data (databases? ETL tools? I’m sure they sounds familiar to many of you);
entities: physical containers of data in a system (tables?);
attributes: physical fields linked to an entity describing its characteristics (columns?).

By defining the Data Catalog is possible to describe in detail the data from a physical point of view.
As you can imagine, having a complete Data Catalog is not something that can be done in few hours, neither few days or weeks many times and this is for sure a problem that any company has to face but, for now, I don’t want to spoil you nothing 🙊

To do a short recap: on one side we have the logical world, the Business Glossary, a dictionary full of business terms inserted by the business staff, while on the other side there is the physical layer, the Data Catalog, which includes the representation of what physically contains the data of the company.

And here comes another spontaneous question: where do they meet? In the next paragraph (not really there, just joking), the Data Discovery.

An elegant appointment between Business Glossary and Data Catalog: Data Discovery

Trying to explain in a very simple way what the Data Discovery is, I will steal a gif from a Disney cartoon (Pardon me, Walt!).

On the left you can see the Business Glossary, on the right the Data Catalog (from “Lady and the Trump”, 1955)

Exactly, your thoughts are right: the “spaghetto” between Lady and the Trump is the Data Discovery.

Data Discovery is a term used to describe the process for collecting data from various sources by detecting patterns and outliers with the help of guided advanced analytics and visual navigation of data, thus enabling consolidation of all business information.

If you are new to this topic, this definition may sounds a bit blurry. Let me do an example by referring to our already defined first two governance pillars.

Imagine your company has a highly populated database having many schemas, each one with a lot of tables inside. Let’s suppose that different schemas are managed by different teams but there are some cross-cutting activities that lead some people belonging to a team working with another team schema.
How can they know what that strange named columns in that peculiar strange name table represent?

First answer: Documentation.

My (everbody’s?) opinion: reading tons of pages is never the most interesting and engaging thing to do! 😵
Of course, I’m not judging its general usefulness!

So we come to the second (better) answer: navigating a good Data Governance tool (better if it includes also the two Governance pillars mentioned before) that has its job done having mapped correctly the physical layer to the logical one (our old friends Data Catalog and Business Glossary, respectively).

How is this mapping done? Two main ways.

Manually (Sounds like a kick in the jewels, doesn’t it?)
Algorithm with rules (yeeee! 👍)

There is a need in the backstage of the second option: the user engagement.
How can you be sure that your algorithm will map exactly all physical attributes to their correspondant logical ones?
If your answer is “I’m the best Data Scientist of the world, I’ll show you”, you’re challenging the human mind and its fantasy to produce very colourful names for tables and columns.

The algorithm can produce a list of candidates, or a single one with the highest percentage of matching accuracy (if you’re really one of the best Data Scientists), but the user need to confirm or modify that choice. It’s a company specific decision the user degree of involvement in the process.

I’ve mentioned the navigation of a Data Governance tool, let me show in practice how it can be done.

Quantyca’s tool for DG: Blindata

Blindata (LinkedIn) is a SaaS platform to manage all Data Management processes of the company. It is divided into two main areas: Data Governance and Data Compliance and you already know on which one I’m going to put the focus (hint: not the second one).

The governance section has four main functionalities:

Physical Model (a.k.a. Data Catalog): here all physical systems, entities and attributes can be inserted, described and navigated;
Logical Model (a.k.a. Business Glossary): it contains all logical entities, called Data Categories, and attributes, called Logical Fields;
Data Flows: this functionality is the one to implement the Data Lineage which includes the data origin, what happen to them and where they move over time;
Analyze: this is the Blindata Data Governance navigation tool.

The mapping between what belongs to the physical layer and what belongs to the logical one can be declared into the Physical Model for each entity and/or attribute.

Let’s start from a big (literally) picture and some indications for you to better follow the graph below:

ERP, ETL, CRM are defined as systems in the Data Catalog;
CustomerMaster, Sales Orders, Customer and >ERP CustomerMaster are defined as entities of their system (again, Data Catalog);
first_name and both Name are all attributes linked to the logical field Nome belonging to the logical entity Customer Master Data.

A demo of the Blindata Data Discovery navigation tool

All the arrows are relations between objects, the dashed ones, instead, represent the Data Flows: in the example, the Customer Name of the ERP system is populated retrieving data from the first_name column in the CustomerMaster table of the CRM system through an ETL job called >ERP CustomerMaster.
This can be a very simple example of Data Lineage which is useful to keep track of the data in order to help the company with legislation requirements and to monitor business changes.

“Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.”
(The DAMA Dictionary of Data Management, 2nd Edition)

Once you have all this information in range, it’s so easy to hold all the steps of the Data Governance together and to give an answer to many questions. For instance: what physical tables contain columns referred to a specific business concept? What is the meaning of data contained in table X from the business point of view? And many others.

One by one all the benefits of implementing a good Data Governance strategy are coming out. Recalling a previous question: why is she so underrated?

Time and resources.

As I mentioned before, it is a continuous process and it involves, in different measure, almost the whole staff of the company. In addition, its benefits are not tangible in short terms. All these features make it the last entry in the projects list of many companies.

In the next paragraph I’m going to show you a prototype solution I developed to reduce the time to have a complete Data Catalog.

Data Catalog incremental feeding

It’s time to spoil our solution to achieve a filled Data Catalog!
Now, if you think to the average dimension of a corporate database, it’s immediately clear that the Data Catalog can’t be feeded manually.

A baby Data Catalog feeded by the prototype

During one of Quantyca’s project for a client, I’ve had the opportunity to deal with this problem considering a Vertica database, with its physical structures, and the Blindata platform for the governance strategy.

I developed a Python script that could detect changes on the database and update the Data Catalog on Blindata consequently, using its native API.

In this prototype two metadata tables are used:

Vertica columns table (v_catalog schema): it contains all metadata of objects actually present into the database. There is no need to create/or keep it updated because it is automatically managed by Vertica itself (more info here);
fields_metadata table (governance_metadata schema): it contains all metadata about physical objects created in the Blindata Data Catalog. It is not present by default in Vertica so it needs to be created.

The application manages an iterative approach by distinguishing between two integration logics: the so-called initial load and the incremental feeding.
If nothing is present in the Data Catalog when the script is executed, the fields_metadata table will be empty and an initial load is required; otherwise, in the table have already been inserted some records and the incremental update logic is applied.

Of course it needs to be planned to be executed with a certain frequency (daily? weekly?), it depends on how often physical structures in your database can be created, dropped or modified.
At the time of writing, the classes of physical changes detected by the application are:

✓ Table creation/deletion

✓ Table schema change

✓ Column creation/deletion

✓ Column datatype modification

As I said, this is just an initial modelization of the solution to feed the Data Catalog automatically and for sure it can be improved and expanded, for example including also the database views or integrating an algorithm to manage the Data Discovery process (this would be amazing! Not just the goal but also the effort to reach it!), and many others.

Conclusions

This was not thought to be an article to give you a lecture on how to implement your Data Governance strategy or to give you the passpartout to the governance world that contains many other topics (Data ownership, Data stewardship, Data quality, etc.).
I just wanted to show you what have to be your first steps in this world and with what mindset to approach them.

If you’ve arrived here, I’m so happy to have not bothered you talking about DG (which, confidentially, is a very easy mistake to do), and if you’ve appreciated my post here’s your chance to clap your hands! 👇