Semantic data types. Systematic approach and types registry

Ivan Begtin
5 min readApr 16, 2022

--

What is semantic data types?

For years I’ve been working on data engineering projects. Sometimes it was simple to create new database/dataset from scratch, but more often I’ve had a lot of legacy databases, unmanaged data sources and open datasets, mostly undocumented. One of common tasks for these projects were understanding data.

One of the ways to start this understanding are semantic data types. What is it? IBM provides following description:

Semantic typing is a method of categorizing data to define how to interpret it. For example, the Person entity semantic type might be assigned to entity types such as Male, Victim and Witness. The semantic type indicates that each of those entity types are different ways of depicting people in the real world.

Source https://www.ibm.com/docs/en/i2-ibase/9.0.1?topic=types-semantic-in-ibase

Sure, IBM and many other big enterprise vendors provide complex tools to identify and manage semantic data types.

There are similar product inside Oracle ecosystem and many other products from other vendors supports same approach too.

Why and how it’s implemented

Most of semantic data types detection engines are rule based with regular expressions or data dictionaries inside. For example, if there is field “person_birth” in table with strings “2014–01–19”, “2003–04–29”, “1976–7–15”, you could guess thats it’s date field and that it has semantic meaning (semantic type) birthday.

Most rules are quite simple. There are a lot of software code libraries and tools to identify/validate urls, bank cards, organizations names, persons names and so on.

Most common usage for these tools are:

  • to find the data that should be secured (personal identificable information)
  • to implement quality control metrics
  • to integrate data automatically

So keywords here are observabilty, security and integration.

Tools and limitations

One of my latest goals was to analyze data from open data portals, to understand how to easiser process of data validation and integration with existing databases. I needed a tool that could analyze hundreds of thousands open datasets and provide me reports on anything interesting that could be found inside.

So I’ve tried several open-source tools that help. piicatcher, piidetect, metadata-guardian, Data Profiler. Some of these tools, like piicatcher, are quite advanced, integrated with existing data catalogs like Datahub and Amundsen. But all of them use built-in rules and primary usage of these tools is to detect sensitive information. They are helpful if you don’t have too many language-specific data and if you need only to analyze PII.

Systematic approach

That’s why last year I’ve been working to create tool and webservice metacrafter. It’s open source and it implements data types detection. And idea was to imlement universal semantic types detection engine with extendable list of rules, similar to Intrusion detection systems. Instead of hard-code implementation, it’s based on YAML defined rules using simplified regex using Python Pyparsing module.

An example of rules related to UUID and GUID types of data:

https://github.com/apicrafter/metacrafter/blob/main/rules/basic/identifiers.yaml

This tool supports JSON, JSON lines, BSON, CSV files. It supports SQL-like databases and MongoDB. It could be used as Python library or command line tool.

This tools exists as open source and as our internal web service. The difference is only in number of rules. Open source version has about 20 public rules, and our web-service has about 200 basic rules and 312 datetime detection rules. Write me to ivan@begtin.tech if you would like to beta test this webservice.

After initial versions of this tool and extensive testing on internal databased and several big open data portals like data.gov.uk, data.opendatasoft.com and many other, I’ve got to understading that rule detection is not enough. My tool and other tools allows to implement rules, but there is nothing about understanding nature of detected data. One semantic data type could be linked to dozens rules. For example, you could detect date by field value, by field name in English, by field name in German and other languages.

So thats how Semantic data types registry appeared.

Semantic data types registry

This registry (http://registry.apicrafter.io) /is the unified catalog of all semantic data types detected by most open source and commercial semantic types detection tools.

http://registry.apicrafter.io/

Instead of simple list of datatypes it includes:

  • patterns areways to write identifiers of the same object. For example, there are at least 3 identifiers of Airports: FAA, ICAO, IATA.
  • countries are jurisdictions that use specific identifiers. For example, US DUNS number or Russian tax number ID, or Australian business number or UK Ward code and so on. All of these example are country specific and semantic types or patterns linked to specific jurisdiction.
  • language is a characteristic of semantic type that indicates that this data type could be language specific for English, Spanish, French or Spanich language.
  • categories are the context and nature of the data. For example PII data is a category, or “finances”, or “companies”.

This logic was derived from metacrafter tool rules, this metadata is helpful if you need to speed-up your semantic type detection engine. For example you don’t need to use France / French rules if you analyze Brazilian data. Or you don’t need to use chemistry related rules if you analyze financial datasets.

So additional information help to filter unnessesary rules.

You could use website https://registry.apicrafter.io to see all collected semantic data types or to download full dataset from http://registry.apicrafter.io/registry.json

Any data type available as web page and JSON file:

This project is still under development. Not all semantic types covered, not all metadata collected and there is no tool that could support detection of all of these semantic types. Even metacrafter that I developed could help to detect only about half of them.

Right now integration with semantic data types registry was done in metacrafter tool. You try it right now on any database or dataset.

Example of metacrafter file scan with datatype urls

I think that this approach is important to improve understandability of the data. This registry could be used by developer of external or internal data type detection tools. It could help to filter rules, to enrich data type detection results, to give advice/recommendations to user using UI of your data tool.

I encourage to join efforts to build universal registry of semantic data types and to reuse existing registry as dataset or API.

Next steps

What should be done next?

  1. Add more data types. Only several language and country specific data types supported right now. Not a much of scientific identifiers supported and so on.
  2. Improve and add more information about data types. Examples, explanations and documentation.

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin