Why Data Quality is Essential for Regulation Compliance — Part II

Tom Warburton
Mesh-AI Technology & Engineering
8 min readJul 18, 2023

Data quality’s increasingly important role demands its own approach as part of ensuring regulatory compliance. In the first part of our mini blog series on data quality, we defined data quality and identified the overlapping areas with regulatory compliance.

In our second instalment, we’ll discuss how to assess the quality of data and how to solve some of the biggest challenges presented by a highly regulated environment.

General approach when tackling regulation and data quality

Many potential methods can be used when addressing data quality for regulatory compliance. This varies depending on the industry, regulation, where the data in question is stored and how it is processed. From experience, we have seen the following general approach being used to assess data quality within a heavily regulated business.

1) Begin with a scoping project to assess data quality for regulatory compliance — which regulations are required, which teams within the business need to be involved, resources required etc.

2) Read through the regulatory text, interpreting and generating requirements for the business. Highlight and derive KPIs that need to be measured, which data points need to be captured, what level of security is required etc

3) Source and identify the system of record for data that is required, highlighting where it is not currently available. Then map the KPI’s to specific fields or a combination of fields from source systems.

4) Identify the domain and data owners for the data and system. These are ideally the people who have the best understanding of how this data is captured, where it is used, and ultimately suggest potential changes to improve the quality of data, either at source or otherwise.

5) Produce data quality analysis across the identified systems and data fields — including the data completeness, accuracy, consistency etc. of the field/system.

6) Highlight key issues based on the analysis and produce a backlog for data and domain owners to fix. Then subsequently ensuring the fixes have improved the output from the initial data quality analysis

Best Practice When Fixing Data Quality Issues

Quality issues can be fixed at multiple stages of a data product’s lifecycle. It is vital businesses have visibility and documentation of these fixes, and a solid understanding of where data for a data product is sourced — i.e. having clear data lineage. Businesses without clear data lineage for their data products are unlikely to have knowledge of prior data quality fixes and transformations implemented on the data being consumed. They therefore run the risk of having inconsistent data and conflicting outcomes, which in turn, leads to a lack of confidence in their data or poor adoption of the product.

Ideally the fundamental data quality issues, such as missing and inaccurate data, are fixed as close to source as possible. This ensures the data consumed by different data products is consistent, and the data products aren’t requiring further downstream treatment or transformation as part of their lineage. However, the reality is that some data quality issues and transformations for data products are required to be applied downstream from the source. This is acceptable, but it is best practice to capture and document these fixes and transformations within the data product’s lineage so that any future consumer of this data, whether it be for a different data product or to make a change to the current data product, has full understanding and visibility of the changes.

Top 5 challenges faced and possible solutions

Below are the top challenges we encountered when helping regulated businesses approach data quality, particularly when using the above steps. In order to aid the process of improving data quality, we have added possible solutions addressing each of the challenges.

1) Challenge: sourcing the correct system of record and identifying the correct data field that is required by the regulatory body. Businesses often have databases storing huge quantities of data, often with multiple data fields representing the same thing, causing a great difficulty navigating all the systems and data fields.

Solution: Introducing a data catalogue to help users to navigate the data environment and describe the systems and associated data tables and fields — including using metadata to tag data fields with necessary tags. For example, when there are data fields which contain PII information, ensure they’re tagged with PII data, when a field is required by a regulatory body, ensure it is tagged with the name of the regulatory body. This will ensure the process of identifying these fields is much more efficient in the future.It is crucial to have naming conventions, hierarchies, formats etc. consistent across all metadata, otherwise this approach may fail.

2) Challenge: ensuring that analysis is action-driven and it isn’t just a high-level view of data quality with no direction of improvement

Solution: For this, it is helpful to build a data quarantine — a method of isolating data records from a production pipeline that are erroneous and require treatment. The main aim of this is to pinpoint the issues within the data so that it can be remediated quickly and efficiently. It ensures that data owners or domain owners know exactly what is incorrect within the data and can see the specific records or fields that need fixing. It can also be used to block erroneous data records being passed through to further applications, systems or other data use cases, preventing any damage. The process would be:

  1. analysing data within a system or database where the data is ingested from source systems
  2. building a set of logical rules within the database to identify erroneous data records
  3. removing or flagging the erroneous data from the production pipeline, so that it isn’t pushed to other systems or reporting
  4. building an alerting system to highlight issues to data owners for them to add to their backlog and fix at source
  5. logging the changes that were made within an auditing system to track data fixes

3) Challenge: analysing regulatory text and highlighting KPIs and data fields that need to be captured. This is a time consuming task and requires multiple iterations to ensure it has been interpreted correctly

Solution: Reading regulatory text is constructed in a difficult format to interpret and derive action from, so it is helpful to transform this into a more digestible format. Taking time to convert the text file into a tabular format, with a regulation field being the lowest level of granularity, and additional fields for the chapter, section etc allows for much easier reading, and allows the consumer to add action items to each regulation.

Another method is using generative AI to read through the regulatory text and suggest data quality KPIs and fields required to be captured and monitored by the business. It may also be helpful to use generative AI to read through the regulatory text and produce the tabular version of the text.

Note: once the KPIs have been identified from the regulatory text, it is vitally important to gather input from other business stakeholders who are SMEs in the regulation area (likely including legal teams) to validate the KPIs. This ensures that data quality analysis, KPIs and subsequent actions aren’t conducted on systems and fields that are not required, or that no KPIs, systems or fields have been missed through the process.

4) Challenge: rolling up a view of data quality from a dataset / data field level to a holistic business level i.e. assessing data quality across an entire business or domain

Solution: Solutions we have used before consist of using RAG (Red, Amber, Green) statuses at different levels of the business. A RAG status is used as frequently as a method of identifying the success of a project, based on particular criteria. The green denotes a positive result, amber being a neutral result often requiring some further attention, and red indicating the project or item is reflecting a negative result and requiring urgent action. In the case of data quality, we will use a RAG status to determine the quality of data across a data field, dataset, domain, and business, based on criteria set at each level. Firstly, a dataset / data field (lowest granularity) has a RAG status determined by the ruleset (defined by the dataset or domain owner and data quality framework defined by a central data team). If any of the rules for the dataset are red, e.g. if there is incomplete data within a required data field, then the dataset is red. If any of the rules are amber, with no red, the dataset is amber. Only when all rules are green does the dataset have a green status. A domain will then consist of multiple datasets/data fields. The domain has RAG status determined by the RAG status of the related datasets. If any of the datasets for the domain are red, the domain is red. If any of the datasets are amber, with no red, the domain is amber. Only when all datasets are green does the domain have a green status. The same then applies to a business — with a business comprising multiple domains.

Having this RAG status allows senior stakeholders to see a company wide view on data quality and there is the subsequent capability to drill down to which areas need action

5) Challenge: Avoiding siloed projects to tackle data quality in one large attempt across all aspects of data in the business

Solution: Tackling data quality with a product approach. Only focusing on the data that is required to meet the needs of the consumer, and in this case, only looking at the data required for regulatory bodies. This will focus efforts and improve the efficiency of the data quality tasks. When assessing the data quality of a system or dataset, only think about what is the required level of data quality for the output to be usable and successful. Don’t attempt to calculate every aspect of data quality and only focus on those attributes which are related to the desired outcome. Lastly this will need to be an iterative process, and not one singular project. Having domain or data owners responsible for data quality of their systems and datasets and part of their remit to consistently monitor and check the data quality of their outputs, will lead to a more consistent high quality output

Conclusion

It is clear that data quality is a significant component of regulatory compliance, yet it is often not tackled in a systematic and scalable manner, and consequently businesses are struggling to become compliant. Improving data quality with continuous effort, from both data teams and domain owners (possibly adopting some of the above solutions), will minimise the time-bound pressure and resources to demonstrate compliance, and will ensure data products consuming data will have a consistent stream of data of suitable quality.

--

--