Mining and Visualizing DBPedia Environmental Issues

Nadjet Bouayad-Agha
Nov 7 · 8 min read

Domain Specific Concept Mining from Linked Open Data

TL; DR

In this article we present our approach to mining DBPedia for domain specific concepts. The end result is a map, available on Github pages here, where the user can select a topic, a date and hover over the dots on the map to view the issues, their categories, locations and short abstracts.

Ingredients:

  • a spoonful of SPARQL queries,
  • a sprinkle of Neo4J representation,
  • some slices of graph embeddings,
  • a chunk of clustering,
  • a handful of entity linking,
  • a sprinkle of Named Entity Recognition,
  • an ounce of geo-visualization.

Now for the recipe, please read along…

Motivation

Developing a taxonomy of terms, concepts or topics in your own domain may be key to helping you in your business. For example, you might want to tag the content of your documents with specific topics to aid search and navigation.

You can ask domain experts, or you can leverage documents or user queries in the domain to induce those concepts, together with domain-specific ontologies to link the concepts and entities found in the text to your own domain. However, the degree of coverage you can achieve will depend on the amount of text you have and the specifics of your domain.

Or you can also fetch those concepts more systematically in the ontologies you have available.

One such knowledge repository is DBPedia, one of the biggest open source knowledge base available, describing over 4 million things from IT, celebrities or the environment. Each concept in DBPedia is typically associated with a Wikipedia page. Furthermore, most of these concepts are organized into an ontology so one can search DBPedia, using the SPARQL query language, for, say, descendants of a particular concept.

In this post I gather specialist terms regardings events and issues about the environment in DBPedia. These are then filtered, categorized and geolocated using Geonames before being displayed in an interactive map.

Extracting the Candidate Concepts

Environmental issues can be about time specific disasters such as the Fukushima Nuclear Disaster, or a more pervasive long-term situation such as Deforestation in Nigeria or the Great Pacific Garbage Patch. By going up those nodes using the DBPedia SKOS broader taxonomy relations, I found that these concepts are either Environmental Disasters or Environmental Issues categories. So I recursively extracted all concepts associated with these two categories using SPARQL queries from the 2016 DBPedia Dump. I ended up with 84613 concepts and 6530 categories.

Looking at the resulting categories quickly reveals that irrelevant categories and concepts are included. So how do we sift through this information?

Separating the Wheat from the Chaff

We can look at the groupings of the categories in the DBPedia ontology. But which groupings do we look at? At what level of granularity?

What we can do is map the graph properties of each category to a low dimension representation such as a graph embedding, and then perform automatic clustering on those embeddings.

Embeddings have been used successfully in Natural Language Processing to map words or sentences to vectors of real numbers. Using these low dimensional representations, machine learning algorithms such as clustering or classification can be applied.

Embeddings have also being used successfully to convert graph properties to low dimension representations. We will use node2vec to convert the properties of the category nodes to embeddings. Then we will feed those to a standard clustering algorithm.

Node2vec takes as input a list of graph edges with integer ids so before using it we must obtain this list. One way to do so is to upload the candidate categories to a graph database such as Neo4J and then query the resulting graph to retrieve the ids of the related category pairs.

Once we obtain these embeddings for each of the 6530 categories, we apply hierarchical clustering on those categories. We choose hierarchical clustering given the hierarchical relation between the categories.

Dendogram resulting from clustering categories using graph embeddings

The figure above shows the resulting dendrogram with two cutoff lines. We manually select multiple cut-off points in the dendrogram and perform corresponding clustering.

We display the multi-level clusters for each category in a table for manual review, an excerpt of which is shown below.

Categories with their cluster number at different cut-off points (k)

We end up with about 500 categories, spanning over 17k unique concepts, 80% of which relate to Biota (Animals and Plants conservation status). The rest comprises locations, people, materials, chemical compounds and more directly relatable environmental topics such as electronic waste or oil platform disasters.

Finding topics

The 500 categories thus obtained were organized into more general topics, e.g., pollution, waste, nuclear, biota, etc, by first removing all terms in the category label whose part of speech is not a common noun. So “Endangered plants of Mexico” becomes “plants”.

After manual editing of this list, we end up with 37 topics, of which we show the top 12 in the table below. For example, concept “Hypermobility” is categorized as “Air pollution” and “Environmental issues with population” and so we end up with four topics for that concept: air pollution, (environmental) issue, pollution and population.

Top 12 topics with number of concepts assigned

Geolocalization using Entity Linking

Now that we have the concepts, their categories and topics, we can extract the concepts geolocations from their short abstracts in the DBPedia repository.

To do so, we first use the DBPedia Spotlight API to find all the concepts in the text, then we look for concepts with geo-locations using a SPARQL query, either by querying DBPedia directly, or finding a Geonames equivalent concept and querying the Geonames API.

We iterate over each sentence, starting from the first one, and stop once a location is found. So for example, for concept Julie_N._oil_spill, we select only “Portland,_Maine” in the first sentence of the short abstract and not “South_Portland,_Maine” in the second sentence:

“The Julie N. is a Liberian tanker that was involved in an oil spill occurring on the Fore River on September 27, 1996, in Portland, Maine. The 560 ft (170 meters) ship was carrying over 200,000 barrels (27800 tonnes) of heating oil and was headed towards a docking station in South Portland to unload its contents.”

With this approach +15k concepts were geolocalized, out of the total of 17k.

Date detection with Named Entity Recognition

We use Spacy to detect time and date expression in the short abstracts. As the number of detected dates is relatively small, we perform manual cleaning and end up with +1.3k concepts with date information, expressed as date start and end ranges. For example, the concept “List of nuclear weapons tests of the United Kingdom” has start date 3 October and end date November 1991.

Interactive Visualization

We present the topics in a dropdown list. User can also move the date slider to display only issues in the specified range. Upon submitting the selection, all issues related to the chosen topic are displayed on the map. User can hover on the little dots and read the issue, its category and location, and if interested, its abstract too. If further interested, the wikipedia page link is also provided at the end of the abstract.

The visualization was implemented using the Bokeh Python visualization library and deployed as a web application using Flask.

Future work

This is an initial proof-of-concept application using concept mining from DBPedia and Geonames in the environmental domain. There are lots of things that can be done to improve the visualization and quality of the provided information, to bring about more automation and up-to-date knowledge, to expand the current domain and to expand to new domains, and to exploit the gathered information for text mining.

  • The concepts were selected from clusters obtained using automatic graph-based clustering. Some irrelevant concepts still remain and the relations of some concepts with specific environmental issues is not straightforward (e.g., a person or an organization) .
  • The linking of locations in the short abstract using DBPedia Spotlight is not always flawless (there are some false positives).
  • The geo-localization of concepts picks the first location available for the concept, which might not be the most suitable one (e.g., it might be “Africa” when the more specific geolocation is “Kenya”).

With respect to the environmental domain, we can enrich the information by mining the abstracts for associated organizations and laws and explore the relations between them. Animal and plant biota do not have dates associated with them. We could get dates (of an animal getting marked as endangered or vulnerable) from another source information.

We can also expand our set of concepts by looking at concepts connected to our selected environmental concepts via an rdf:seeAlso relation, or via a link to the concept’s page in the article. With the former, we gathered about 600 extra new concepts, some of which are relevant such as extreme weather, wildfire, water crisis. With the latter, we extracted +171k concepts, not all of which are relevant. However there are interesting concepts, such as lakes or regions with environmental issues, endangered animals, people involved in disasters, etc.

Much can be done to improve visualization:

  • We can use different size dots: the more concepts per location, or the more Wiki page links from that concept to other environmental concepts, the bigger the dot.
  • We can color dots according to categories (although there is a problem that a single dot can represent multiple concepts)
  • We can associate concepts with available pictures in Wikipedia page, to make the visualization more fun.
  • We can exploit Wikipedia/DBPedia multilinguality to provide the page and concepts in different languages.
  • We can allow search and zoom in per location.
  • We can show connections between concepts.

The concepts are up to 2016. We could incorporate updates from DBPedia Live. Deciding whether a new concept is an environmental issue or not, and what kind of issue it is could be done automatically by training a supervised model using the already categorized concepts.

Given the taxonomy of concepts thus obtained, we could mine texts such as news articles for domain specific concepts.

Using a similar approach we could develop taxonomies and visualizations for other domains from DBPedia and other Linked Open Data, such as Civil Rights or Diseases. Anyone?

KAWARISM

Expert in Natural Language Processing, Natural Language Generation & Knowledge Engineering

Nadjet Bouayad-Agha

Written by

KAWARISM

KAWARISM

Expert in Natural Language Processing, Natural Language Generation & Knowledge Engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade