Understanding Your Data

With Automatic Classification in Azure Purview

Carl Follows
Version 1
4 min readSep 1, 2022

--

As companies realise the potential value of managing their data, and the cost of failing to, there is a growing interest in ensuring proper governance over organisational data. Who should be allowed to access what?

Before you can implement any data governance
you need to understand what data your organisation has.

Library bookshelves curving off into the distance
Photo by Susan Q Yin on Unsplash

Understanding what data you have can be a tricky task when you’ve been collecting it for years. The people who designed the systems are perhaps no longer with the company and design documents are patchy at best. It’s like being given control of a library, but without any index to know what’s there.

This is when you need something to automatically scan and classify each item.

Purview

Microsoft’s data governance portal, Purview, comes with several components to help discover and catalogue your enterprise data. Amongst them is a classification engine that uses a combination of RegEx and Bloom Filters to automatically detect certain types of data.

Whilst we can imagine that there must be a standard pattern to phone numbers and that there must be a finite list of country names to check against (if only we could remember them all). Microsoft has already done much of the hard work for us by providing a set of rules that we can use to scan our data, covering the likes of banking numbers and passport number formats from around the world.

Scan Rule Set

Depending on the rate of change in your data you might want to schedule monthly scans to check for new data types to be classified. In practice, we can be confident that many of these classifications are highly unlikely to be in our data, and therefore we don’t want to waste compute power (and associated cost) continually scanning for them all unnecessarily.

We are therefore able to configure a scan to use just the set of rules which we expect might be in the data, to find all the locations where they are stored.

Multi select list to configure which classification rules are to be used
Configuring a scan rule set in Purview

Compatible Data Sources

Like any index or data catalogue, Purview doesn’t store a copy of the data, just meta-data describing the datasets and their attributes. This means that the process of classification (looking for patterns in the data) must take place at the source.

Because of this, the ability to classify the data depends on the data source. Not every source is capable of running the classifications, and some are still on the Microsoft development road map: check here. As you might expect there is a lot of support for Azure data services with the other names following on.

Check your data sources are supported for the capabilities you wish to use.

Missing from the list

Once you start looking through the system classifications you’ll realise data items with strong international agreed definitions (like passports and bank accounts) have the greatest support. Some really obvious ones are missing (lack of U.K. Postcode surprised me), but Purview is still in its infancy so expect these on the roadmap.

Even if Purview did have classification rules for all standards in every country you are still likely to have requirements for organisational identifiers that will never be supported by Microsoft. To handle this there is an option for creating custom classifications, I’ll look at that in my next blog.

Only the start of the Journey

Any automatic indexing system can only do so much for you, it will highlight probable data classifications, but without context. Are those postcodes related to customer addresses or store locations, the former is personally identifiable and therefore confidential, whilst the latter should be shared as widely as possible.

This means manual effort will be required to interpret the results and build a dictionary of organisational data. This should inform users what data is held where and who is responsible for maintaining and controlling it.

Data Governance is not something that can be delivered by just buying and configuring software, it requires organisational commitment and ongoing oversight. As in any library without a librarian, indexes go stale and chaos returns to the shelves.

About the Author:
Carl Follows is a Data Analytics Solution Architect here at Version 1.

--

--

Carl Follows
Version 1

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure