Introduction to Data Lineage, Data Governance and Data Dictionary Use Cases and Application
As organizations grow, one problem is that there is too much data. If there is no systematic way to manage data, when you want some kind of data, you will not find it because the database is too large. What’s more, the same data (for example, a customer’s name) may appear in different databases with different contents. For example, the customer name stored in database A is Tommy, and the name of database B is Tammy. How the program handles this conflict will be a big issue. The following will discuss the methods of managing data between enterprises (data governance). It will also discuss one of the important concepts of data management — data lineage, and the three lineages subdivided therein: vertical lineage, horizontal lineage and business lineage, and then introduce the use cases and uses of data dictionary in the enterprise.
Whats' data governance?
Simply put, data management is the management of the life cycle of data, and managers need to ensure that the data remains high quality (i.e. accurate and complete) throughout the life cycle.
Take the account opening of a customer as an example:
Step 1. Data Generation: Managers need to ensure that when customers enter data, the data is not missing and accurate. The management method could be to require the customer to enter it twice, to make certain items (e.g. ID number) mandatory.
Step 2. Data Maintenance: Managers need to ensure that customer information is kept up-to-date and accurate. For example, when a customer needs to inquire about transaction records, address and telephone verification may be required. If it is found that the customer has updated personal information at this time, the administrator needs to ensure that the data in the database can be updated to keep the customer information accurate.
Step 3. Data use: When the data needs to be used, the administrator must ensure that there are sufficient access rights restrictions. For example, a specific approval process is required to read data, and a stricter approval process to modify data.
Step 4. Data Archiving and Purging: Administrators need to ensure that data is retained for a certain period of time. For example, if a customer requests to delete an account, the customer’s personal data should be cleared after a certain period of time to free up database space and ensure that the data is kept updated.
For a user, they are often a data user (p.s. of course, a large part of them are data inputrs, but the input program usually has format restrictions, and there is no major problem). Many applications are producing reports, analyzing data trends and reporting performance. When they get a bunch of data, they usually ask “what is the source of this data”, “how is this data calculated”, and sometimes “I don’t trust the data in this report”. To solve these problems, is to clarify the application and database data lineage.
What is data lineage?
The relationship of data to databases and applications is called data lineage.
After the data lineage is organized, the user can know which system the data came from, which calculations and modification of a certain data will affect those processes, which is of great help to the data governance. Managers will know which data can be deleted, which data has duplication, and solve the problem of data inconsistency. And data lineage includes three forms that will be discussed below: vertical lineage, horizontal lineage, and business lineage.
When a user wants to use data, it is often unclear what data is available.
Taking the analysis of customer data as an example, users may not know how an IT system stores a customer’s data. For example, does the IT system store the customer’s income? Does it store the customer’s name and gender? To answer these questions, you only need to understand the structure of the data. (The following figure is expressed in XML Tree — showing the Customer’s data, including Account Name and Name (First Name + Last Name))
And this relationship from Biz Term to IT Term is exactly what vertical lineage wants to get. And the IT Term here is exactly IT metadata . Because what the user wants is not a single data, but the structure of the whole data.
In summary, vertical lineage is:
- Link Biz Terms (business friendly set of terms) to IT metadata;
- Example: Linking Business Glossary (a dictionary with different terms) to Technical Metadata (a dictionary describing the schema of the data).
For a developer, having a data structure is not enough. If you want to create an application (such as a data reporting dashboard), you must know the exact source of the data, including the database or the columns of the application:
And the relationship between these data fields and source database is exactly what horizontal lineage wants. (p.s. Because the information is very technical, horizontal lineage is also called technical lineage).
In summary, horizontal lineage is:
- Connect Data Fields to Source DB/ App Column;
- Examples include the Schema that connects the Reporting Dashboard’s Data Field to the Database.
Business lineage is an application of horizontal lineage and vertical lineage, and its focus is on the reporting dashboard. For example, a user wants to know whether user habits have changed in the third quarter. If the business lineage is organized, the user can see that customer is related to sales per customer and ages in reporting dashboard, and can use the data field of reporting dashboard for different analysis. For example, analyze the relationship between sales and ages, and then use the results to formulate promotional strategies.
In summary, business lineage is:
- Link Biz Terms (Business Friendly Set of Terms) to Reporting Fields;
- Examples include reporting fields, which link the Business Glossary (a dictionary containing different terms) to BI Dashboard.
Data Dictionary Application and Use Case
Data dictionary is a platform that stores these data lineages, and presents vertical lineage, horizontal lineage and business lineage to facilitate data governance.
The following figure is an example. The data dictionary records the process from data entry to data storage, and then to data processing, data analysis, and reporting. When the data flow becomes clear, you will find that some processes may be repeated. For example, there are two reporting dashboards in the figure below to report the same set of data. If the data processing system is removed, which data analysis and reporting system will be affected.
The presentation of these data flows can greatly help managers to remove and optimize data management processes, making data management more efficient and ensuring data consistency.
Thank you for reading my ariticle and I hope you’ve enjoyed it. If you want to know more about data lineage, data governance and data dictionary, you can visit Gudu SQLFlow for more information. As a data lineage analysis tool, Gudu SQLFlow, can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display.