source: https://mitratranslations.com/en/little-translation-mistakes-that-cause-big-problems/

Bring ontology into textual chaos

Learning ontology classes from text

by clustering lexical substitutes derived from language models

Artem Revenko

Published in

Semantic Tech Hotspot

8 min readDec 6, 2022

Based on our paper. Big thanks to the co-authors Victor Mireles, Anna Breit, Peter Bourgonje, Julian Moreno-Schneider, Maria Khvalchik, Georg Rehm.

If you are not interested in the descriptions and prefer hands on, head straight to the “Try It Yourself!” section.

The current amount of data produced by the humankind is estimated at around 100 zettabytes (=1,000,000,000,000,000,000,000 bytes) in 2022 (link). Different authors on the web (for example, link1, link2, link3, link4, link5) claim that around 80% of this data is unstructured. Unstructured data includes many types of data such as video, images, text, etc. Though videos and images do take a significant storage space, we might still assume that there exists huge amount of textual data. And this textual data might even be more prominent when we speak about enterprises.

What knowledge is expressed in this textual data? Can we automatically extract and analyze this knowledge? It is clear that this task is very challenging. Several methodologies such as Event Extraction, (zero-shot) Named Entity Recognition, Relation Extraction aim at tackling this task. However, most of the developed methods rely on some predefined scheme of data to be extracted: Event Extraction use predefined types of events and their roles for extraction; zero-shot NER requires a certain representation of unseen types to recognize their entities efficiently. Then can we at least come up with with a good scheme/ontology to describe this knowledge? Such a scheme/ontology would not only enable human — machine interface via making data machine readable and also human interpretable, but also could power further downstream tasks:

Semantic Search: ontology would provide facets for better search experience, see below an example usage scenario
Matching: various items could be matched to each other using their ontological descriptions and relations (for example, matching employees to projects)
Similarity Estimation and Duplicate Detection: commonalities between items of the same type could be computed more precisely using ontologies
Question Answering: we could combine text and ontology to provide more precise answers to questions, see also our poster Polylingual Hybrid Question Answering System
Data Interoperability: usage of the same or interlinked ontologies enables efficient integration of data
Database Population: finally, ontologies enable efficient information extraction using the methodologies mentioned above (NER, Event Extraction, RelEx, etc.)

source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6173224/

Usage Scenario — Sports Betting

We choose one particular usage scenario to demonstrate the task. Take a company that provides a service for betting on sport event outcomes. To enable users to bet efficiently (and, therefore, more often) it is crucial to provide an efficient search functionality, so that users can quickly find relevant news, report, analytics, statistics and make their betting decisions.

Which categorization will enable user to search quickly and precisely?

One way to categorize the news would be to elaborate a categorization of news manually. However, this might be an intensive exercise and would enforce a categorization that is not guaranteed to be optimal w.r.t. any criteria. Moreover, we might want to have a personalized search experience so that facets are personal. For this purpose we could learn an ontology from the texts. We would first identify the entities of interest for a user: this could be individual sportsmen, teams (clubs or national), sport disciplines, etc. Then we would collect the news and other textual documents where those entities appear and induce a classification of the entities that corresponds to their contextual usages. This way the induced classification would be tailored to the user’s entities of interest and their classes.

Depending on user’s interests the search facets might be different

Task Statement

Given

a corpus of domain-specific documents and
annotations of entities in these documents,

our task is to find domain-specific categorization of entities of interest.

source: P. Cimiano. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, 2006.

In terms of the “ontology learning cake” (see this book by P. Cimiano) we aim to learn concepts and also hierarchies of concepts. Though we refer to “concepts” as “classes” of entities.

Method — Grouping Entity Senses into Classes

Our overall idea is to solve this task in two steps: first, define domain-specific contextual senses of entities and second, group this senses into classes.

So what would be the best way to capture the contextual meaning of the entities of interest? We essentially want to describe the sense of an entity in a given context, for each entity and each context. This formulation of the task is very close to another task of Word Sense Induction — I have published a dedicated post on how we can leverage pre-trained language models to solve this task. We will reuse ideas and techniques from WSI in this task as well.

However, in another post Anna Breit introduced a task and a model that is superior to the ideas of WSI — the target sense verification (TSV) approach. TSV is superior in that it does not require a complete sense inventory (and, therefore, it does not require WSI) to disambiguate the entities. Therefore, what is the point of going back to WSI and not using TSV models directly? And the answer is that, though we do not need the complete sense inventory, we need to find descriptions for some senses, namely the ones that are in the domain of our interest. And we can use the class induction procedure for exactly this purpose: define the domain-specific classes of our interest and assign entities to them.

Processing Flow

Here I briefly and superficially introduce the processing flow of our method. If you would like to dive deeper then you could either (1) have a look at our paper or (2) have a look at the code implementing the flow or (3) reach out to me.

(1) Create Substitutes. At this step we use a pre-trained language model to produce contextual lexical substitutes. We produce a total of 2*k substitutes, k is the first parameter of our method.

(2) Extract senses. Second we cluster the substitutes (produced in the previous step) into senses of entities. For each sense we take m sense descriptors as the description of the sense, i.e. m is the second parameter of our method.

The first two steps are described in further details in this blogpost.

(3) Induce Classes. Third we cluster senses of entities into classes. We restrict each class to have at least th class descriptors, th is the third parameter. This clustering procedure closely resembles the step (2).

Overall, we consider that our method has the following advantages:

WSID included: handles polysemy of entities of interest;
Interpretable results: the descriptions of the produced classes and senses are interpretable by humans, no prior knowledge required;
Focuses on senses, not occurrences: we cluster senses of entities, therefore emphasizing the importance of the contents of texts, not frequency of entities;
No particular text structure is assumed or required;
Parameters (k, m, th) allow for granularity control.

Evaluation — How well does it work?

Evaluation of the produced classes is not trivial as no gold standard exists. We have decided to use openly available knowledge graphs, in particular Wikidata, as our reference, i.e. if we can induce classification that resembles a relevant part of Wikidata ontology then we consider that our method performs well. As Wikidata is curated manually, in case of success we could automate a part of tedious and knowledge-demanding manual task.

We take the WikiNER corpus for our experiment. WikiNER consists of Wikipedia pages with entity annotated with coarse grained NER types. As the method only requires entities we do not use the information about the types. Next we induced the classes for the annotated entities. We do this for many different combinations of our parameters k, m and th. For every such combination we obtain a classification, i.e. a set of classes with their entities and their descriptions. To evaluate the quality we check how many of the entities that were grouped together by our method would also be found in a single (best matching) Wikidata category. We compute p-values with lower p (typically, p<0.05) making us accept the hypothesis that the particular class contains the entities that really belong together and not just a random collection of entities.

Below you can see the resulting plots for English and German. We can observe that for almost all combinations of our parameters we can produce some arguably meaningful classifications.

English:

German:

Try It Yourself!

The code to execute the procedure as well as comments that will get you started if you want to try your own corpus is here:

I have copied the outcomes with th=3 and th=5 into the tables that follow. If you run the procedure yourself you might get not exactly the same, but quite similar results. Each row in the table is an induced class; class has descriptors — the human interpretable descriptions of the class — and entities that it contains. The entities contain suffix ::<NE type> indicating the original annotation in WikiNER. We did not use these types in the processing, but preserved it for completeness. If you run the code as provided you will see additional suffices ##i indicating the sense; I have stripped these off for better readability.

Classes induced for 25 WikiNER documents with a minimum of 3 descriptors for each class

Classes induced for 25 WikiNER documents with a minimum of 5 descriptors for each class

The induced classes are ordered and the better candidates being on top. For some classes, especially after the top 10, we observe that number of descriptors is sometimes larger than the number of entities. On practice we might want to filter out such very specific classes.

Overall the induced classes seem to make sense. You are welcome to investigate the results yourself. Keep in mind that we only used roughly 25 Wikipedia pages with 973 unique entities and 1951 occurrences of these entities in the corpus.

I will now provide a tiny analysis of selected classes from the induced classifications to also demonstrated how the parameter th could be used to produce hierarchies of classes.

First, note that class 1 in both classifications has to do with “Arctic”, “North”, “Alaska”, however, with th=5 the results become more specific as additional descriptors “Northwest” and “Pacific” are added. Certain entities only broadly related to territories such as “Alaska Native Heritage Center” and “flag of Alaska” are excluded. Class 1 with th=5 is a subclass of class 1 with th=3.

Class 3 in both classifications. Both classes contain either people names or entities related to certain people. However, with th=5 the class contains only proper names, for example, “Anton Anderson Memorial Tunnel” is excluded.

Similar observations hold for further classes, for example, 4 and 5. Interestingly, class 7 with th=3 as a superclass for class 7 and 19 with th=5. Class 7 with th=5 contains terms broadly related to space, whereas class 19 with th=5 contains spacecraft names.