Exploring Wikidata for Natural language processing

Petar Todorov
U Change
Published in
5 min readJul 28, 2017

Part I: Getting and visualising hierarchies.

I. Introduction

Data, data everywhere! In the company I work for* — called simply U, — we want to know how other companies adopt new technologies, and which ones. To do this, we are under a huge flow of unstructured data: texts that contain lots of information we feel useful, but not immediately usable by a human when they come in huge quantities. This is why we need to extract information from texts in a way useful for further algorithmic treatment, and Natural Language Processing provides the necessary framework to approach this problem.

Models like word2vec and gloVe allow to learn embeddings of words, based on their co-occurrence information and give extraordinary results. But frequently, we don’t simply need to know about co-occurence of words, but also about broader subjects. In exploring options to achieve this objective, a thought struck my mind. “Why not use the vast depository of knowledge that is easily accessible in Wikidata?” After all, once we “know” what a given word “means”, we can expand its meaning by adding relevant properties from the knowledge base of Wikidata. Indeed, Wikidata does not only provide the litteral expression of the word, but also all the information associated with the corresponsing entity and its links to another entities.

Based on this approach, I developed a two-pronged strategy to augment my texts with information not literally present there, with the use of the Wikidata knowledge base :

  1. To extract all the items in the Wikidata arborescence with a given root — (In two example runs, I used 2 roots: “Computer Science” and “Software.” The visual interpretation of the results can be seen here and here respectively).
  2. To create a table with all properties I find relevant — (The results for “Computer Science” and “Software” can be seen here and here.

NOTE: The code used in the subsequent Jupyter notebook and the class behind are released on Github under the MIT license. Please feel free to experiment.

Ever since the birth of the Wikidata project, I have been an ardent follower of its evolution. I have always been fascinated by its ethos, which is, to create an abstract representation of all knowledge of the Wikimedia foundation (all language versions of Wikipedia, Wikimedia, and, the forthcoming, Wiktionary), into one single database, and providing this colossal store of knowledge into the public domain. Wikidata also allows you to access the database via its SparQL endpoint, and if you wish, download the whole (>125 GB) dump as a json file and treat it locally on your computer. The latter approach is very expensive in computational power for no added value.

The size of this knowledge base has also been an ingredient of fascination — Unlike DBpedia, that has been a reference for structured knowledge before, which as of July 2017, has ~4.5 million entries, Wikidata has a much larger scope, currently totaling ~26 million entries.

Wikidata is thus a knowledge base, whose model could be briefly described the following way: the base is composed of entities, who have a “fingerprint” (label in multiple languages; descriptions in multiple languages; aliases), and statements. The base has a graph structure, meaning that every entry (identifier starting with Q) is connected to another entries through “properties” (identifier starting with P). A primer on the Wikidata data model could be found here.

To provide some context of this model, here is a sample subgraph, connecting “software” and “Ubuntu,” and showing properties of “Ubuntu:”

In summary:

  1. All entities have a unique ID : “software” is Q7397.
  2. Entities are linked to another entities by a property. Properties are also identified in a unique way. instance of is property P31. Thus, “Ubuntu (Q381) is instance of operating system (Q9135)” means that the item {"id": "Q381",…} will have the claim P31 linking it to Q9135.

II. Getting the information

As I said, my goal is to get all possible software technologies that exist in Wikidata. For this, I will start from a given node (e.g., “Q7397”:”software”), and will recursively explore all the subnodes. Note that there is no limitation of depth in the recursive call, other than the native limitation in python. As the graph is not necessarily a tree (a tree is a graph with no cycles), which would put the reccurence in an infinite loop, by convention I chose to stop the exploration of the graph if the node is already visited by the current path. The particular path that the recursive function takes in order to reach a final node could also be of interest: for instance, it could be userful to know that in order to go from ‘software’ to ‘C++,’ one path passes through the node ‘Object-oriented programming language’. This is useful if one wants to count all mentions of _any_ OOPL in a document, for instance.)

Also, depending on the application, we might want to forbid the recursive function to explore nodes out of scope for the given application. This could be achieved with the forbiddenparameter, which takes a list of nodes to ignore. (In our example, we don’t explore Q28923=Chart and Q7889=Video game, since those nodes are out of our scope.)

After having cloned the WikidataTreeBuilderSPARQL repository from GitHub, doing all this in Jupyter in straightforward. In the same repository as the classes, you can find a sample notebook exploring the ‘Software’ node, as well as a notebook exploring the Computer Science node.

A preview of the `Software` node exploration notebook

III. Conclusion

Let’s conclude by the example and take a look at the information we extracted for “scikit-learn” from file ComputerScienceTable.xlsx.

In the row 18, we discover that in order to reach ‘scikit-learn,’ we have gone through nodes Computer Science and machine learning, that its French desciption is librairie Python d’apprentissage statistique, aliases are scikits.learn, sklearn, and scikit, the last version 0.18.1 is released on 15 november 2016, that it’s instance of library, Python library, and machine learning, that it’s written in Python, Cython, C, and C++, it’s released under BSD license, you can find it at github on https://github.com/scikit-learn/scikit-learn. All this is meaningful information about scikit-learn, that you might use in order to augment you textual data. This approach could also have potential applications in active learning, where it can help increasing the size of the corpus by detecting new examples for the training set. Don’t hesitate to play with the files and notebooks provided with this post and to comment if you find it useful.

In a sequel, I’ll be providing an usecase for this approach. Stay tuned!

Acknowledgements: For help with this publication, and the work behind it, I would like to thank Kary Bheemaiah, Ari Bajo Rouvinen, and Guanguan Zhang.

*Disclaimer: The thoughts written here are my own personal opinions and do not represent my employer U.

Find me on LinkedIn: https://www.linkedin.com/in/petar-todorov-ph-d-13326949/

--

--

Petar Todorov
U Change

Trained as an astrophysicist, working on AI/Natural Language Processing, Wikipedian, LGBT activist. I like python.