Mining and Geovisualization of Domain Specific DBPedia Concepts

Published in

Analytics Vidhya

9 min readNov 7, 2019

TL; DR

In this article I present some work I did to extract domain specific concepts from DBPedia and show them on an interactive map of the world.

The domain is “Environmental issues”. You can play with the map here, select a topic, a date, and hover over the dots on the map to view the issues, their categories, locations and short abstracts.

Main ingredients:

a spoonful of SPARQL queries on DBPedia and Geonames triplestores,
a handful of entity linking with the DBPedia Spotlight API,
an ounce of geo-visualization with the Python Bokeh visualization library.

Now for the recipe, please read along…

Motivation

Developing a taxonomy of terms, concepts or topics in your own domain may be key to helping you in your business. For example, you might want to tag the content of your documents with specific topics to aid search and navigation.

You can ask domain experts, or you can leverage documents or user queries in the domain to induce those concepts, together with domain-specific ontologies to link the concepts and entities found in the text to your own domain. However, the degree of coverage you can achieve will depend on the amount of text you have and the specifics of your domain.

Or you can also fetch those concepts more systematically in the ontologies you have available.

One such knowledge repository is DBPedia, one of the biggest open source knowledge base available, describing over 4 million things from IT, celebrities or the environment. Each concept in DBPedia is typically associated with a Wikipedia page. Furthermore, most of these concepts are organized into an ontology so one can search DBPedia, using the SPARQL query language, for, say, descendants of a particular concept.

In this post I gather specialist terms regarding events and issues about the environment in DBPedia. These are then filtered, categorized and geolocated using Geonames before being displayed in an interactive map.

Dataset

In order to gather the information, you first need to install the necessary DBPedia and Geonames dumps on a local Virtuoso triplestore. This is explained in this short post.

For DBPedia, the dumps timestamped 30th of August 2019 were used, more specifically categories, types and transitive types, labels and broader relations.

Short abstracts and sameAs relations were not available for this dump at the time of doing this work and so this information was obtained from the latest 2016 dump. For concepts added between 2016 and 2019, short abstracts were obtained by querying the live dbpedia sparql endpoint.

Extracting Environmental Concepts

Environmental issues can be about time specific disasters such as the Fukushima nuclear disaster, or a more pervasive long-term situation such as Deforestation in Nigeria or the Great Pacific garbage patch. By going up those nodes using the DBPedia SKOS broader taxonomy relations, I found that these concepts belong to the Environmental issues category.

So can we just iteratively query concepts subsumed by the Environmental issues category and we’re done?

A loose non transitive hierarchy

Unfortunately things are not so straightforward, because the broader relation between categories is very loose and non transitive, as the example in the diagram below illustrates.

Snippet of broader relations between some DBPedia categories (Neo4j)

In this example, category “Ginger ale” has as broader categories “Ginger” and “Carbonated drinks” (but it is not a Ginger). And “Carbonated drinks” has a broader relation with “Soft drinks” and “Carbon dioxide” (which it contains rather than is a kind of). We can go up the broader hierarchy from “Carbon dioxide” all the way to “Environmental issues”. However, it does not follow that “Ginger ale” is a kind of “Environmental issues”.

A semi-automatic iterative extraction approach

Extracting all the categories in the hierarchy in a single pass and then manually filtering them is not the solution since too many categories are obtained (+1.3 million). More importantly, a category can subsume thousands of other categories, so by removing it we automatically remove the ones underneath it.

So to obtain relevant environmental issues categories, we proceeded iteratively as follows:

Starting with a set of root categories containing only “Environmental issues”, we extracted all the categories to a depth of N,
We manually reviewed those new categories to decide which ones to keep and which not.
Go back to 1, setting set of root categories to be equal to the categories we decided to keep in step 2. Stop when no new categories are obtained.

Using this method with N=2, we gathered 2679 categories in 6 rounds, of which only 1363 are kept. These allowed us to retrieve 27765 environmental issue candidates, i.e., DBPedia concept whose subject is one of the 1363 categories.

The final stage of sifting through the +27k candidates is currently manual, although grouping by subject categories or by candidate name helped speed up the process (see future work at the end of this article on how this process could be automated using text mining). The decision to keep or not a concept was based on its label, categories and short abstract.

We end up with just over 19k concepts. These were organized according to their categories into 28 topics as shown in the table below. The table shows that 90% of concepts relate to Biota (Animals and Plants conservation status).

Note that the count of concepts is higher in the table than the total number of concepts because one concept can be assigned more than one topic. For example, the Fukushima Disaster is assigned topics “nuclear” and “disaster”.

Geolocalization

Now that we have the concepts, their categories and topics, we can extract the concepts location and geo-coordinates.

Where do we find the concept’s location?

The location information is extracted from the concept’s short abstract. Failing that, the concept’s categories labels are used, for example “Endangered flora in Australia”.

The short abstract can consist of multiple sentences. So we iterate over each sentence, starting from the first one, and stop once a location is found. The intuition behind this is that the concept’s main location is prominent in the text and so tends to appear first.

For example, for concept Julie_N._oil_spill, we select only “Portland,_Maine” in the first sentence of the short abstract and not “South_Portland,_Maine” in the second sentence:

“The Julie N. is a Liberian tanker that was involved in an oil spill occurring on the Fore River on September 27, 1996, in Portland, Maine. The 560 ft (170 meters) ship was carrying over 200,000 barrels (27800 tonnes) of heating oil and was headed towards a docking station in South Portland to unload its contents.”

Finding Geoname location via Entity Linking

To identify some phrase in the text as location, we perform entity linking with the DBPedia Spotlight API. This tool automatically annotates the text with DBPedia concepts. For example, we might query:

curl http://localhost:2222/rest/candidates --data-urlencode "text=The president of United States" --data "confidence=0.7"

This will output the following:

<surfaceForm name="United States" offset="17"><resource 
label="United States" 
uri="United_States" contextualScore="0.12995535973599495" percentageOfSecondRank="5.591384938510379E-5" 
support="563995" 
priorScore="0.0028691179552717944" 
finalScore="0.9998957005362644" 
types="Wikidata:Q6256, Schema:Place, Schema:Country, DBpedia:PopulatedPlace, DBpedia:Place, DBpedia:Location, DBpedia:Country"
/></surfaceForm>

Since we’ve set the confidence score to 0.7, we only get the named entities whose final score is above that threshold.

The annotation indicates that the surface form “United States” offset at index 17 corresponds to DBPedia concept “United_States”.

For every named entity concept, we then look up its Geonames identifier via a sameAs relation.

With the Geonames identifier we can obtain the concept’s geo-coordinates and country code using the Geonames SPARQL endpoint.

Selecting the most specific Geoname location(s)

We might get more than one set of geo-coordinates for a given concept. We must eliminate the ones that subsume the more specific ones, for example we may have geo-coordinates for both “San Diego” and “California” so we can remove coordinates for “California” so we get a more precise geovisualization.

This is done using a SPARQL query that looks up whether there is a Geonames parent relation between the 2 identifiers. If so, the parent identifier is eliminated.

Date stamping

We use Spacy’s NER to detect time and date expression in the short abstracts. As the number of detected dates is relatively small, we perform manual cleaning to specify the start and end date of each concept, focusing mostly on events with dates such as disasters or nuclear test. For example, the concept “ 1949–51 Soviet nuclear tests” has start date 1949 and end date 1951. We end up with handful of time stamped concepts (+400).

Interactive Visualization

We present the topics in a dropdown list. User can also move the date slider to display only issues in the specified range. Upon submitting the selection, all issues related to the chosen topic are displayed on the map. User can hover on the little dots and read the issue, its category and location, and if interested, its abstract too. If further interested, the wikipedia page link is also provided at the end of the abstract.

The visualization was implemented using the Bokeh Python visualization library and deployed as a web application using Flask.

A short video of the interactive visualization

Future work

This is an initial proof-of-concept application using concept mining from DBPedia and Geonames in the environmental domain.

As a future work, I would like to do the following:

Mine Wikipedia articles for specific textual content related to environmental issues.
Improve the visualization experience.
Automate integration of new concepts.
Apply the approach to other domains.

Mining Wikipedia Articles for Domain-specific content

Finding sections, paragraphs or sentences about the domain in the Wikipedia articles would have three main advantages:

Finding concepts that are not in the domain hierarchy but are still relevant to the topic. For example, we might find articles talking about businesses, towns or rivers affected by environmental issues.
Automating the decision about whether a concept is about the domain. The current approach involves quite a bit of manual curation, checking the article. We could develop a set of classifiers of to determine whether a section, paragraph or sentence in the Wikipedia article is about the topic of interest.
Associated the concepts with more fine-grained content. Currently the user is able to grasp the relevance of a concept to the domain via its label, categories and short abstract. But by mining snippets of texts which are specific about the domain, we can give more relevant information to the user.

Improving the visualization experience

Much can be done to improve visualization:

We can use different size dots: the more concepts per location, or the more Wiki page links from that concept to other environmental concepts, the bigger the dot.
We can color dots according to categories (although currently there is the problem that a single dot can represent multiple concepts).
We can associate concepts with available pictures in Wikipedia page, to make the visualization more interesting.
We can exploit Wikipedia/DBPedia multilinguality to provide the page and concepts in different languages.
We can allow search and zoom in per location.
We can show connections between concepts using Wikilinks.

Continuous automatic updating

An important functionality would be to automate the integrate of new concepts. For example, we might want incorporate monthly updates, or we might include topics found by mining Wikipedia articles (our first point in the future work). Deciding whether a new concept is an environmental issue or not, and what kind of issue it is could be done automatically by training a supervised model using the already categorized concepts. We could use graph and text embeddings to do so.

Develop other domain taxonomies and visualizations from DBPedia

Using a similar approach we could develop taxonomies and visualizations for other domains from DBPedia and other Linked Open Data, such as Civil Rights or Diseases. Anyone?