Artificial intelligence and data governance AI data governance
The adoption of artificial intelligence is rapidly spreading across many businesses. This technology is driving constant improvements in the decision-making processes and overall performance across a large variety of industries. It is also helping to better understand customer needs, improve service quality, predict and prevent risks.
The implementation of a proper data governance framework is essential to enable organizations to fully unlock the potential of their data. This post explains what data governance is and why it’s relevant to artificial intelligence.
Data governance consists of the set of procedures designed to properly manage data. Appropriate policies must guarantee the availability, usability, integrity, and security of enterprise data. In machine learning, data governance procedures ensure that all the interested stakeholders across the enterprise have always access to high-quality data.
Scales of data quality
In machine learning, just as in computer science, the saying “garbage in, garbage out” holds true. This means that even the most advanced machine learning model will perform poorly when fed with low-quality data. So, how would one get to assess data quality before it is actually used? A data quality assessment process starts by defining a list of data dimensions. Data dimensions are features of the original data that can be measured against pre-defined standards. Some of the most common data dimensions are:
* Accuracy. It measures how reliable a dataset is by comparing it against a known, trustworthy reference data set. If refers to a single data field and it usually relates to the number of outliers caused by database failures, sensors malfunctions, wrong data collection strategies, and so on
* Timeliness. It is the time delay from data generation and acquisition to utilization. Data that is used later than when it was collected might be obsolete or no longer reflecting the physical phenomena it is explaining
* Completeness. It refers to the percentage of available data or equivalently, the absence of missing values
* Consistency. Data is consistent when the same data located in different storage areas can be considered equivalent, (equivalent can have several meanings from perfect match to semantic similarity).
* Integrity. High-integrity data conforms to the syntax (format, type, range) of its definition provided by e.g. a data model
For a more detailed and comprehensive discussion about data dimensions, refer to  and .
Measuring the right things
The dimensions monitored vary depending on business requirements, processes, users, etc. For example, in social media data, timeliness and accuracy are probably the most important quality features. However, since social media data are usually unstructured, consistency and integrity might not be suitable for evaluation. For biological data instead, data storage software and data formats are very heterogeneous. Thus, consistency might not be the most appropriate metric as a quality dimension.
Once data and dimensions to be monitored have been selected, it is important to define a baseline of values or ranges representing good and bad quality data, that is the quality rules the data needs to be assessed against. Moreover, each dimension will have different weighting which determines how much it contributes to the data quality as a whole. How rigorous the rules need to be and how to choose the aforementioned weighting for each data dimension pretty much depends on the impact that the single organization put on the monitoring phase.
For instance, one may easily agree with the fact that incorrect or missing email addresses would have a significant impact on marketing campaigns. In this case, one would put very low thresholds on the tolerated number of missing records and high weighting on completeness and accuracy. The aforementioned low threshold would in fact minimize the number of missing emails while the high weighting would guarantee that the existing records are available and reliable.
The same applies to inaccurate personal details that may lead to missed sales opportunities or to an increase in complaints from customers.
Keeping data quality high
Once this preparation phase has been completed, the process enters the data acquisition phase, followed by the data monitoring stage. The latter consists mainly of a data quality assessment process and an issue resolution process.
The Data Quality Assessment process, can either produce a data quality report at regular intervals or a continuous recording of the quality scores. Such scores, stored in a database, would help in tracking data quality over time.
Monitoring and solving issues
The issue resolution process enables either people or automatic software tools to flag issues and to systematically investigate and resolve them. Of course, the more informative such logs are, the more efficient the resolution of the data quality problems will be. As suggested in , certain information should be always included in all logs.
Each issue should have a unique identifier. Sequential numbers are good identifiers because they tell the number of issues identified so far. Good ideas for statistical information are: grouping issues into categories and recording the opening and fixing dates of problems. The latter allows to compute the average issue resolution time and compare it with a target). Logging the person who has raised an issue eases reporting progress and agreeing on action plans.
Logs must also include the data owners, who are responsible for investigating and fixing issues related to the data they own. An informative log helps estimating the impact of a problem and prioritising efforts for problem resolution correctly.
Data is appropriate for consumption by third parties and for building machine learning models only after it has passed quality control without significant problems.
At we dedicate a large part of our time to assessing data quality. We adopt a continuous approach: from the early stages, data collection, cleaning and transformation up to data integration and model design, we monitor data closely. This ‘ whole pipeline ‘ approach speeds up the development and debugging of our models and ensures top performance. The data scientists following our strategy know exactly what to improve when a model is not performing as expected.
Of course, the real value of data lies in the support it gives to the decision processes of an organisation. Any enterprise aiming to adopt artificial intelligence for their processes should implement a data governance framework to ensure the quality of their data.
It is essential to deal with data governance for your organisation. Because good data always leads to great decisions.
Originally published at https://amethix.com on November 13, 2019.