Wikipedia API Tutorial

Working with APIs can be scary and intimidating but they are extremely useful to use for data analysis. In this tutorial we will work with one of the most easy-to-use APIs, Wikipedia. We will begin with a step-by-step guide for how to retrieve data from a Wikipedia page using the easily accessible Wikipedia API, then we will walk through the exploratory data analysis checklist to make sure our extracted data is correct and in a usable format. We will finish the tutorial by using the data to answer an analysis question.

STAGE 1 — Access the Wikipedia API

1. Download Python Libraries:

To start, there are a lot of Python libraries that we need to import so we can utilize them later on. Requests is a Python library that will allow us to send an HTTP request to access the API. The API request will be in JSON and so we have to import the JSON library. BeautifulSoup is a Python package to parse through HTML code to extract data and is commonly used when web scraping. Datetime is a Python module used to manipulate dates and time to a usable format for our data analysis. The other libraries we imported are all fairly common when working with Python.

2. Access the API and Extract Revision History:

To make it easier to extract data from multiple Wikipedia pages in the API, we will begin by writing two functions that will access the API and extract the revision history data. The functions will then convert the API data from JSON to an extremely user-friendly Pandas dataframe that we will use to analyze the data later on. To run the function, call get_page_revisions and input the name of the Wikipedia page you are looking to extract data from. For this tutorial we will be extracting data from three pages, “National Collegiate Athletic Association”, “College Basketball”, and “College Football”.

3. Extract Pageviews History:

Since we are already communicating with the Wikipedia API, write another function that will pull the pageviews data for all three articles we will be analyzing. We can call the function by using the variable names we assigned to the Wikipedia articles in step 2.

Great! We have now accessed the Wikipedia API, converted it to a Pandas dataframe, and extracted the revision history and pageviews data for three Wikipedia articles. Our next stage is to begin analyzing our extracted data by going through our exploratory data analysis checklist, or EDA. The five steps of EDA help us to formulate our analysis question, clarify that the data we had intended to collect was extracted, identify any apparent issues present in our data, and use a simple analysis technique to find an easy solution in our data.

STAGE 2— Exploratory Data Analysis Checklist

1. Formulate an Analysis Question

For this tutorial we want to answer the question: What months do the most pageviews and revisions occur on the three separate Wikipedia articles?

2. Check the Packaging

This step is important because it gives you an idea for the number of columns and rows in each of the three dataframes. You should have a preconceived estimate of the amount of data that was extracted, so this step should clarify that the actual data and your estimate align.

3. Look at the Top and Bottom of Data

Our API requests pulled all of the revision history data from the beginning of each page until February 10, 2019; and the pageviews data from July 1, 2015 until February 10, 2019. We want to look at the top and bottom of our dataframes to make sure that we have the first and most recent edits and pageviews.

4. Check the “n”s

When checking your “n”s you want to validate all of the other preconceived estimates you have about the data. For our NCAA revision history data, we want to find how many different people made edits to the articles and perform basic statistics on the revision sizes. Before running the code you would be able to infer that with 2,638 total revisions made to the page, there would be a lot of different editors, my guess would be around 1,000. And for the revision sizes the minimum size should be negative, the maximum would be positive, and the average would be close to zero. After running the code we find that our estimates were validated by the data.

5. Try an Easy Solution

The last step of our basic EDA checklist is to try to find an easy solution: who is the top editor for each of the three articles? To answer this question we will use value_counts() on each of the ‘user’ columns in our dataframes.

We find an interesting commonality amongst our three pages and that is they all have the same top editor, Dale Arnett. After searching for his Wikipedia user profile, I found out that Dale Arnett is a lawyer from Kentucky who enjoys editing Wikipedia as a hobby in his free time. He is an avid sports fan (hence the reason why he is the top editor for these three large college sports pages) and he estimates he has edited more than 11,000 Wikipedia articles!

STAGE 3— Answer Your Analysis Question

Now that we have extracted data from the API and checked off all of the items on our EDA checklist, we can answer our analysis question: What month does the most pageviews and revisions happen on the three separate Wikipedia articles?

We will start by comparing the overall revision and page view numbers. For the revision history, NCAA has the most edits, followed by Football then Basketball; as for pageviews, NCAA has significantly more views, followed by Basketball then Football. It is surprising to find that the College Football article has almost twice as many edits as College Basketball despite having a fourth of the pageviews. I assumed that a high number of pageviews would lead to a high number of edits, although that doesn’t apply in this case.

Pageviews per Month
Revisions per Month

Let’s now break down the total pageviews and revision history data to answer our questions of which month has the most views and which month has the most edits for each of our three articles. Our get_page_revisions and get_pageviews functions that we worked with in Stage 1 loaded in the timestamp data and converted it to a datetime object that we can manipulate to answer our questions. To find the top months for revisions, we will extract the month out of all the datetime objects and then use value_counts() which will return a string of the counts for each unique value. For the month with the most views, we will again extract the month from the datetime objects, but this time we will need to perform a groupby/ aggregation function to find the sum of all the views for that month. We then will use Altair to plot our results.

We have now answered our questions! The NCAA and College Basketball Wikipedia articles have the most views in March, while College Football has the highest views in August. This makes sense because College Basketball’s March Madness is the biggest event for NCAA sports and August is the start of the College Football season. There is a significant correlation between NCAA views and the College Basketball and Football pageviews.

As for the months with the most edits being made, the NCAA and College Basketball articles are being edited the most in March, which is the same as the top pageviews. The College Football article, however, has the most revisions made in January which is a different month than when it has its most views. These findings are validated by the fact that March is College Basketball’s March Madness tournament and there is a lot going on for the sport; and January is when College Football Playoffs occur and the Wikipedia article needs to be updated with the tournament results. Again, there is a significant correlation between NCAA revisions and the College Football and Basketball article revisions.

We have now completed our tutorial for how to access and work with the data extracted from the Wikipedia API. I hope that after completing this tutorial you now feel more comfortable working with APIs and you feel less intimidated by them. I would like to thank and give credit to CU Professor Brian Keegan because this tutorial uses resources made available by Professor Keegan’s class notes.

--

--