This post will walk you through the process of collecting structured Wikipedia data via a couple of nice APIs produced by the Wikimedia group. I have used these APIs for projects in my line of work as a data scientist at Bertelsmann, which has accumulated interest throughout the organization. I wrote an internal version of this article for my colleagues, but, since this data is free and open to any one to use, I figured I would also write a public version so that anyone can access this data too!
So, here goes.
What is this data source I speak of?
The first source I will walk you through is the Wikidata SPARQL API.
SPARQL is a lot like SQL, except evil. But don’t worry, you really don’t need to know a whole lot about SPARQL to use this service, thanks to the Wikidata SPARQL query interface, found here: https://query.wikidata.org/.
The easiest way to get started with this service is to click the query assist button (see gif).
That will open up an interface for you to search for the topic you are mainly interested in, using the “Filter” tab. Here you can search for anything — books, people, dogs, cats of the internet, trees, clouds, monkeys, etc…. — and the Wikidata db will return a query for you, and print the first 100 results in the bottom half of the page. You can adjust these limits to your liking, but keep in mind that sometimes the query might be too big and the service will time out. :’(
The Filter tab is for your overarching topic. The tab just below it, the “Show” tab, is where you can add some nuances.
So, if my Filter query was for TV Programs, my Show query might be for something like Genre, or Language, etc. you can add as many of these as is relevant to what information you want to pull.
You can also add more specific arguments to your Filter query to narrow the search. For example: if I search for politicians, and female, this will return all female politicians that exist in the Wikidata Base. Then I might be interested in the birthplaces of these female politicians. So in the Show tab, I type “Birth Place” which will update the query accordingly.
There are a ton more examples of queries that you can explore via the Examples tab in the SPARQL Interface. I got started in my own work by adapting one such query, so I recommend using that to learn. There is also a lot of information/ learnings/ tutorials that Wikidata has published. This one is pretty good:
** Note: This API Is FREE/OPEN SOURCE **
SPARQL is the first step for acquiring relevant information for my topics of interest. But, if my goal is to understand which Wikipedia page under my topic of interest is getting a lot of views, we need to pull that information via a separate API.
Welcome to the Wikimedia Rest API!
This API is pretty simple, but powerful. It allows you to retrieve page views data ( as well as other kinds of data) about the Wikipedia pages you put into it. It’s very easy to use.
The API takes 7 arguments:
Project: This is the specific wikipedia project you want to pull from. For example, if I want the page views to come from english wikipedia, I would specify ‘en.wikipedia.org’ as the project. This can be adapted for any language.
Access: Enables you to differentiate between users on a desktop, mobile app, or all.
Agent: Specify if you want only people users, or bots as well.
Article: The name of the page for which you want to search. For example, if I am searching for TV shows and I want to pull views for the page of the show The Masked Singer, I would need to pass that name into the API like this: The_Masked_Singer. All natural spaces between words need to have a “_” in between.
Granularity: Can be either Daily or Monthly
Start: The min date for which you want your views. Note, the minimum year you can specify is 2015. The date entry must follow the format: YYYYMMDD or YYYYMMDDHH
End: The max date you want for your views.
Here is an example of what your API URL will look like once all the above features are filled in:
You can copy/paste this in your terminal/ command line to see the output:
curl -X GET “https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/The_Masked_Singer/daily/20200601/20200602" -H “accept: application/json”
** Note: This API Is FREE/OPEN SOURCE **
Ok, Cool! But How Do They Work Together Programmatically?
To show you how they work together, I made a repo in my Github that you can go to and try for yourself! It includes functions for retrieving all info for your respective query. The groundwork you must do is explore the Wikidata SPARQL interface to come up with a query. In the repo you will see where to copy/paste that query so the code can get you the results.
Here is the link to my Github repo where this code and walkthrough live: https://github.com/AdamKirstein/Wikipedia_data_collection_example
I hope you find this helpful! This is a really rich data source, so I imagine it can support many use cases!