Semantic data types metadata sources: Wikidata, Wikipedia and other

Ivan Begtin
3 min readMay 21, 2022

--

Recently I wrote about semantic data types systematic approach. I keep updating my own semantic types detection project Metacrafter and build the registry of semantic data types — metacrafter-registry.

It’s already has a lot of updates and most important of them are:

  • regular expressions for most data types and patterns
  • examples of data types
  • link with Wikidata project properties.
Metacrafter registry structure

Existing structure of the registry include a lot of data and grows. Probably next steps will require regular data syncronization with other metadata related projects like Wikidata.

Why Wikidata matters?

Wikidata is a one of most interesting data project project linked with Wikipedia. It’s a public structured database of everything, very similar to Freebase created by Metabase technologies, acquired by Google and now succeeded by Wikidata. It allows to define data model of every type of an object and to link them using automated tools and manual edits.

Each Wikidata entity has number of properties, full list of properties you may find here. Some of these properties are classification codes or unique identifiers of certain objects.

For P239 ICAO Airport code, assigned by International Civil Aviation Organization to each airport. This unique code commonly used in datasets and databases related to transport and aviation.

O P2771 Data Universal Numbering System (DUNS) commercial identifiers issued by Dun & Bradstreet (D&B) for legal entities in USA. Commonly used in US government procurement data systems. It could be replaced by another code, LEI (Legal Entity Identifier), soon. It’s also named as a property in Wikidata P1278.

Most meaningful data types that used in databases have properties with descriptions and examples in Wikidata. Most, but not everything.

I am working on linking Wikidata properties to every data type possible, for example here is link to ISIN numbers.

International Securities Identification Number YAML https://registry.apicrafter.io/datatype/isin

Already 21% (46 of 217) of data types and patterns linked with Wikidata properties.

Outside Wikidata

Wikidata, just like Wikipedia, covers public data, something available for free and online. But there are a lot of data, data types outside openness and freedom of use. Most known type of such data is personal data. You will not find most of personal identifiers in Wikidata.

Another limit is a country and topic very specific types of identifiers. For example, US Procurement Instrument Identifier (PIID) is outside of Wikidata scope. Probably it will be added someday, but not now and may be not so soon since procurement system is too specific for common users of Wikidata or Wikipedia.

That’s why Wikidata is great but not complete and it’s impossible link every semantic data type with Wikidata without updating Wikidata properties.

So there are a lot of other data types sources.

Let me mention some of them:

Compexity

Semantic data types are more than just it’s automated detection. Some semantic data types are hard to identify automatically since it’s too easy to get into false detection trap. Too many data dictionaries use very simple identifiers like 2–3 numbers or letters. But registry of semantic data types could help not for automated data identification but for writing data documentation too.

Please, feel free to share your ideas about data types detection and join metacrafter-registry and metacrafter development.

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin