Faster & Stronger Data Governance

Data Governance: Harder, Better, Faster, Stronger — Part IV

Vincent Rejany
8 min readSep 16, 2019

How to execute data governance faster and stronger when there is so much work to do? One key element to help enforce data governance is a set of data services built on the technology platform to automate the data governance processes and to enforce data governance throughout the enterprise. The first rule for automation is of course to build once and use repeatedly. Data preparation jobs, ETL processes, data quality controls, should be always designed in a generic way, so they can be reused.

However, there should be also a set of built-in capabilities, which would facilitate the identification of data and the completion of classic data management tasks. A lot of knowledge is already there and only calling for being consolidated and used.

“Automation” would encompass all the disciplines allowing to generate insights on top of metadata, content, user activities and feedback, and data governance policies. It is not restricted to the use of artificial intelligence, but can also relies on classic integration and analytics, reporting or even deterministic, rule-based approaches (If then). “Automation” can also facilitate a personalization and smoothness of the user experience in the solution, as it streamlines actions that user does often, wants to do or has never been thinking about.

Let’s look at the types of information that could be analyzed and processed for supporting “automation”.

· Tables, text files, documents, and other data sets, with their respective metadata

· Metrics generated over data sets through data profiling and data discovery activities.

· Referenced data and controlled vocabularies such as list, hierarchies, look up tables.

· User’s actions, behaviors from application logs

· User’s feedback, comments, and rating

· Internal or external policies

This is list is not exhaustive and other set of information could also be used. There are as well multiple use cases for using and crossing these different sources for generating insight about which data management or governance actions to be executed. These use cases can be summarized into four categories: data discovery, suggestion and recommendation, anomaly detection, development, and administration.

Automated Data Governance Use cases

Data Discovery

Data Discovery covers a wide number of features. It aims at reviling either what is obvious from a human perspective but not documented, or what could be cumbersome to find out. For example, as a user I know that one data set contains one column “Email” as I can easily identify the column label and the pattern. However, if I had to review a whole data lake for assessing where I could find emails, and this information is neither in my data catalog, nor documented in my glossary then it becomes clearly a heavy and error prone exercise. It would be far more efficient if the machine could do it automatically and analyze all the data sources with a consistent logic.

SAS® Viya® Data Explorer — Data Profiling — Identification Analysis

SAS® Viya® Data Explorer supports already such functionality when profiling in memory CAS tables. This functionality is supported by an identification analysis definition name “Field Content” available in the SAS® Quality Knowledge Base. Depending on the country and the language, different types of variables can be identified, such as: Individual, Organization, Delivery Address, Payment Card Number, Email, Postal Code, City, State/Province, Phone, Social Security number. This definition can be enhanced to support additional variables. It is a helpful exercise when running compliance programs with data privacy regulations like EU GDPR. Extensions of identification capabilities are currently under investigations through the use machine learning techniques, especially for extending the types covered and allowing users to build their own identification models. The identification of languages being used or sensitive data (such as data related to race and ethnic origin, religious or philosophical beliefs, political opinions, trade union memberships, health biometric or genetic information, sexual life, or preferences) or medical records information could benefit from this approach.

The classification of data sets is an interesting area too. From the identification of the columns it could be interesting to deduce and assign a domain name to data sets, i.e. “Contacts Records”, “Invoice Data”. Such tagging capability at the data set level facilitates the search, the access and the security of the data ecosystem.

Identification and classification capabilities are key features because they are the mandatory step in terms of data cataloging and data privacy principles enforcement. The automation of data masking processes like pseudonymization or the securing of sensitive data sets depend on the ability to identify content.

From an analytical perspective, there is also a wide number of use cases to identify whether the information quality of a data set is suitable for analysis. To support an accurate use and analysis of data, we need to ensure that all data needed for analysis are complete, or if not to propose the calculation of imputation values for missing values. The computation of analytical metrics such as “Skewness” and “Kurtosis” for interval variables combines with completeness rate helps in assessing if certain variables should be excluded or not. The extension of data profiling activities with such analysis are a foundation for the generation of insights, such as:

· Assessing the overall quality of one data set

· Scoring the readiness of one data set for analytics

· Clustering data sets with similar structure and data

· Identifying redundant data sets

· Applying data masking automatically

Through data profiling, identification of content and computation of advanced statistical metrics, data discovery is the cornerstone of automated data management and a pre-requisite for making suggestions and recommendations.

Suggestion and recommendation

It could be quite complex to make a distinction between suggestions and recommendations. “Recommendation” could be considered as a benevolent information, an alert, a signal, that is not obvious or an outcome that could be predicted, in other words “This data set would need to be pseudonymized”. Recommendations are usually proposed based on the analysis of past actions for recommending new ones. A recommendation is not always an action and it should inspire trust and confidence. One good illustration is the recommendation engines available in most retailer websites such as Amazon. According to research by McKinsey, a mind-boggling 35% of Amazon’s sales and 75 percent of what users watch on Netflix come from product recommendations based on such algorithms. These statistics were reported in 2013 and it might be higher today. Classic use cases would typically propose to combine one main data set with complementary or alternative data sets either because it makes sense or because many other users did it too, such as:

· Recommend “corporate” sponsored data sets or actions

· Recommend a well rated prepared version of the same data

· Recommend another table containing the same type of columns/records to use/substitute or union.

· Recommend a table for enriching the data with additional variables

· Include in the user interface a “Did you know?” widget

We would consider that “Suggestions” aim at proposing an action, like suggesting a next best action, that is: “Apply standardization on ZIP code”. For example, SAS® Viya® Data Studio will soon embed suggestions of data preparation steps.

SAS® Viya® Data Studio — Suggestions

Examples of automated data preparation suggestions that would be supported within SAS® Viya® Data Studio:

· Apply SAS® Quality Knowledge Base data quality functions such as gender analysis, parsing, standardization

· Enrich data through address verification and geolocation

· Rename, remove, convert, obfuscate columns

· Impute missing values

· Fix outliers

· Transforms for normality: exp, 1/x, x2; ln

· Normalize values, so they are all in the same range

From the data discovery metrics, it could be also interesting to propose business and data quality controls through the analysis of frequency distributions of values and patterns as well as the combination of variables.

From a data governance perspective, “Suggestions” could also help in getting critical terms to be created relying on the analysis of variables across reports and data sets, in prioritizing issues to be remediated or recommending data retention period based on the sensitivity of data.

Anomaly detection

The anomaly detection use case aims at identifying potential risks in data, or in data management operations. There are multiples opportunities in that domain as data discovery metrics provide several measures allowing to measure how spread variables are, as well as outliers and frequency distributions of values and patterns. Such measures help in identifying values that are out of range or in detecting inconsistencies in columns that have been identified (for example, inconsistent emails, URLs, ZIP codes, codes with specific patterns or referring to defined reference data).

Record-level analysis can also allow to identify potential duplicates and combine with suggestions, entity resolution processes can be created. In case of presence of personal data, the analysis of the risk of reidentification of individuals is also an excellent use case for fulfilling data privacy principles.

Another area of automation is the analysis of data quality metrics trends such as completeness, consistency, accuracy. Uncommon variations of these metrics are typical signs of data quality issues, for example: number of records processed on day 2 differs by 50% versus day 1, or completeness rate for one variable did drop by 10% over one period.

Platform logs analysis can also reveal incompliant use of data or data breaches. The only caveat here is the generation of too many anomalies, therefore there is a need to assess their importance.

Development and administration

This final category focuses on facilitating and streamlining Data Management and Governance activities for ETL developers, data stewards and platform administrators. The intention is to automate repetitive manual tasks i.e.:

· Auto complete of ETL or data preparation steps with the most likely configuration.

· Auto map variables when building ETL jobs

· Propose integration/preparation templates according to specific use like de-duplicate one data set, match, and merge two data sets, enrich one master data set, build SCD type processes …

· Select the most appropriate compute engine depending on the operations, the databases used, and the volume of date.

· Support performance self-tuning within data integration jobs, queries according to data volumes, storage type

· Alert on scheduled data management processes duration taking more and more time

Possibilities are unlimited until speed is increased.

Considering the increasing volume, variety, and velocity of data to analyze, added to the metadata, users’ feedback/rating and actions created as part of the platform, automation is the only way to perform effective Data Governance and build the necessary trust in your data and from stakeholders. It must be combined with a constant focus on democratizing data governance and to take it out from an IT only perspective.

Data Governance can be smarter, it can be automated, by relying on analytics and artificial intelligence so personal and sensitive data can be detected, business rules or quality controls can be suggested, and remediation actions can be proposed. Without minimal human interaction. Tremendous times are coming for building such services, which will empower data governance products to make suggestions and recommendations of actions to performed to users. Past this, it could even surface the invisible business rules or relationships between columns or data sets and facilitate the subsequent remediation of issues through prioritization.

--

--