Using Wikipedia Category Graph for Semantic Understanding
A key element of Relcy’s search engine is a semantic understanding of entities and topics which help us create a more relevant experience. We use Wikipedia as one of the sources for entities and topics. Each Wikipedia article has categories assigned to it which we then use to develop a semantic understanding of the topic.
We use categories provided by dbpedia.org to understand what a Wikipedia article is about. But the coverage of dbpedia categories is around 65% (2.8 Million out of 4.3 million). We wanted to understand the other 35% too. So we decided to develop automated techniques based on inference from the Categories provided by Wikipedia for its articles.
The Wikipedia category graph is very rich with around 1 million categories. Each page may have multiple categories assigned to it, and many of them are very granular. For example, the page for the American Revolution has the following categories assigned to it, many of them providing a temporal specialization of a large category
For the point of view creating experiences, we are interested in less granular and more generic categories (eg: History, Politics, Landmarks, Movies, Politicians, Actor). We can use these coarse categories to create semantically aware presentation. Hence, we decided to come with a much less granular Relcy ontology and map each Wikipedia page to a category in that.
Traversing the category graph
One approach we tried was to traverse the category graph from the categories on each page and find out the closest node representing a coarse category. However, we saw lots of noise in this approach because of the way wiki category graph is organized.
Relation between a Category of Page/Category is not guaranteed to be an “Is Instance Of” relationship; it could also contain other relevant “Related To”relationships that can become misleading when it comes to a semantic classification of the entity. This happens because the nature of the relationship is not explicitly captured in the Category Graph.
Example 1 : A Relationship based on an “Is Instance Of”
‘New York’ -> ‘Populated places on the Hudson River’, ‘Cities in New York’ . These are pretty relevant categories when it comes to inferring a less granular category. If we traverse a few levels above, we will reach ‘City’/’Populated Place’ which helps us understand that New York is a city.
Example 2: A Relationship based on “Related To”
New York is a port city on the Atlantic coast and thus the category Port Cities of United States Atlantic Coast. Atlantic Ocean is Two levels above and from a type inferencing point of view, is completely misleading without any extra information of the semantics of the relation to the category (which does not exist).
Rule Based Inferencing
Most of the wikipedia category, especially the leaf level categories, have names that are english phrases, which are granular in their description. Our next approach was to make sense of the linguistic structure of the category names themselves.
The Category names follow certain regular patterns that can guide us to the less granular category.
Take for instance the pattern for the “<label1> IN <label2>” , <label1>. This is very likely going to be a less granular, parent category.
‘Landmarks in San Francisco’. This is a very clear indication that the category is landmark .
We developed rules like this to understand the types of the pages.
- “NUMBER_births” : People
- ‘companies_based_in_PLACE’ : Organization
- ‘organizations_established_in_PLACE’ : Organization
- ‘compositions_by_ARTIST’ : Music
- ‘LABEL1_of_LABEL2’ : Find if LABEL1 can be mapped to a wiki category we are interested in
— Government of United States : ‘Government’ -> Topics
— History of Human Rights : ‘History’ -> Topics
The initial guess determined by the pattern was validated by ensuring the less granular category was indeed reachable via graph traversal from the leaf category. If multiple coarse categories were inferred, we ranked them based on the number of leaf level categories that mapped to the high level one.
To evaluate the precision of results obtained by this approach of type inference, we did an editorial evaluation on a sample of 1000 randomly selected wiki articles. Except for very few outliers (< 10), the pattern matching based approach gave the correct coarse category.
Bag of words
While the precision of the previous approach was very high, there were still a large number of articles which did not have categories which we could pattern-match (0.54 million ie ~13%). For these cases, we decided to use a simple bag of words approach.
We split the categories to n-gram tokens, checked if these tokens are valid categories and looked to see if the tokens are reachable (path length < 8) from wiki article by traversing the category graph. We added these as low precision categories for these wiki pages.
‘New York City’ has a category ‘Former capitals of the United States’. From this we pick the word ‘capitals’, verify if the category capitals exists as well as if it’s reachable from ‘New York City’.
A simple implementation of this with n-gram tokens gave us about 85% precision over the subset of pages that could not be classified using the pattern matching technique and subsequently resulted in understanding an additional 420K Wikipedia articles.
For our serving stack, we use a mix of the aforementioned approaches. If we have Dbpedia categories we utilize them else we look at Pattern Matched categories. If these don’t exist, we look to Bag of Words based categories. From a mix of these approaches, we are able to have 97% coverage in our understanding of Wikipedia categories as represented in the below table.
Stay tuned for further posts from Relcy Engineering!