Data Engineering concepts: Part 3, Data Quality and Governance
This is Part 3 of my 10 part series of Data Engineering concepts. And in this part, we will discuss about Data Quality and Governance.
Contents:
1. Data Quality
2. How to implement Data Quality
3. Data Governance
4. How to implement Data Governance
Here is the link to my previous part on Data Warehousing:
What is Data Quality?
Data Quality is the assurance of data being accurate, complete, fresh, reliable and applicable to the business requirements that an organization needs to work on. Investing in data quality will save you a lot of time and effort and prevent any errorneous situations.
The different kinds of data quality checks can be:
a. Null checks (eg.If you find out that an optional item is null and you need it for analysis)
b. Volume checks (eg.If you see that the amount of rows is much higher than expected)
c. Data type checks (eg. Headerless files might not have the right order of columns you’re expecting)
d. Range checks (eg. A transaction value is more than you’re expecting)
e. Category checks (eg. You might have a non existent State abbreviation)
f. Freshness checks (eg. A transactions that happened a few minutes ago is not registered)
g. Uniqueness tests (eg.Ensuring that there are no duplicate rows)
h. Referencial integrity checks (eg. Ensuring that foreign key matches the primary key in other tables)
How to implement data quality?
- Notification system
You can have a system designed to send notifications when any of the data quality checks are violated like Slack notifications. - Data Quality dashboard
You can build a dashboard to show data quality check results like volume tests and range checks. - Data Quality operators
You can have prebuilt operators that are automatically implied on to the data processing pipelines to ensuring everything is in place.
What is Data Governance?
Data Governance refers to the process of defining control, access and standardization policies over the data to provide security and effectiveness over time. There are data stewards assigned to be held responsible for data quality and enforcement of these policies and data lineage is used to track the source to destination throughout the data lifecycle.
The data governance framework has 3 main components:
1. Policies
There are certain government policies like GDPR that need to be implemented to ensure the rules set by government or you can have organization level policies as well for general compliance.
2. Rules
Data protection- sensitive data has to protected like SSN
Governance- the access should be authorized according to time and content
3. Classification
Business classes — eg. utilization rate is measured differently in different business domains
Data classes — according to metadata
How to implement Data Governance?
For implementing a data governance framework, you need to build a people centric approach because people should be able to take responsibility over the data they manage. And the governance framework should be iteratively refined according to the progress of the applied strategies.
Data governance can be implemented through 3 different models: Centralised, Decentralised or Federated/Hybrid model
There are few steps that need to be taken in order to implement a data governance strategy:
1. Identify and prioritize the exisiting data :
Classify the data and create metadata and data catalogs for existing data
2. Prepare and transform metadata
Create templates for data dictionaries and have cleaned and tranformed form of data across of departments of the organization.
3. Choose and build a governance model
Choose any suitable model described above and start implementing how data will be stored, maintained and disposed.
4. Establish a process for distribution of policies
Provide proper onboarding and have all the teams be on the same page in terms of the policies like GDPR and usage guidelines and restrictions.
5. Identify potential risks
Keep the data up to date with upcoming security policies to store data securely and give limited acccess to data.