Leveraging machine learning to classify your database
SQL Data Catalog is Redgate’s tool for identifying and classifying sensitive data. Our tool provides a comprehensive filtering and automated rule system. It relies on column location (instance, database, schema and table) and information (name and data type).
This is fine when dealing with databases that follow a human-readable convention. You can set up a rule that marks all columns whose name contains “email” and are of string type as “Sensitive”. The tool will go off and identify all such columns (and even apply the tag if set to auto-apply).
Unfortunately not all databases are created equal and sometimes the column name gives no hint of the data contained within. At this point our users would have to either go look at the data or ask the people who created that column about its purpose. It also means we cannot automate classifications of such columns.
This is a gap in our capability that we’re very eager to fill. And this is where machine learning could prove very useful.
Classification is the most common use-case for machine learning. Lookup a ML tutorial and the first thing you’ll learn is how to train a model for sentiment analysis. That’s a task called binary classification. But what is classification in the context of machine learning? Classification is a task to apply labels to pieces of data, be it text, image or video using a machine learning model. A model is built from examples of classifications using an algorithm. The process is called training while the example classifications used to train the model are named datasets.
Training such a model generally requires Python, TensorFlow and a lot of coding. As our product is .NET, integrating with Python is not straightforward. Fortunately, ML.NET came to the rescue and we can finally train and consume ML models in .NET without relying on Python.
Training a model with ML.NET is straightforward. All you need is a dataset and Visual Studio. There is a nice wizard to load the dataset, define which column is the label to predict and which columns are data and features then let it rip. The label in our case is the information type while the data is examples for each of the types we want to predict. For this example we have not defined any features ourselves instead relying on AutoML’s default featurization, like occurrence of each character, transforming text into vectors or identifying frequency of terms.
AutoML will then begin training with several algorithms and return the one with the highest accuracy. It will also build the input and output models, the code to instantiate a prediction engine and the code used to train the model along with the model itself packaged in a zip file.
This can easily be integrated into existing C# projects. And that is what we did. You can see in the image below a very early prototype of identifying the information type contained in a column using a pre-trained ML model.
So, why haven’t we released this yet? There is a saying that an ML model is as good as the data it’s trained on. ML.NET abstracts away all the complexity of picking an algorithm and coding and running model training. But you still have the responsibility of building (or finding) a dataset that works for your use-case.
Our current models, because we experimented with a variety of dataset combinations, work well for most data but fail for some corner cases. In some cases first names are identified as cities or passport numbers as phone numbers. One way we mitigate some of these errors is by taking more samples from a column then returning the prediction that occurred the most with an average of confidence between those predictions. This way even if, say, one name is labelled as city, the other 99 will be labelled correctly and the prediction will be accurate.
We are very happy with the results and confident we’ll be able to leverage machine learning to provide great value in an ingeniously simple way.