Scraping Wikipedia: The Tutorial

When first scraping Wikipedia, one must find something that they want to compare/contrast, record, and analyze. For this tutorial I used the example of scraping Versace’s Wikipedia page for revisions and trends and then comparing that to Gucci’s Wikipedia page.

First you must always import all of the functions that you will need in order to do this. If you do not do this step, you will be unable to scrape the data properly to create your own data frames.

Import all the functions needed to preform this action

Next, you should make sure that you are looking at the current content of the page itself, where you will first create all the queries and the parameters for each of these queries. These will show you what the requests to the API actually look like. For this step I used the Versace’s Wikipedia page to demonstrate how you would implement this step into your code. As you can see below we must scrape out the hyperlinks to other articles. It is super important that you must have the MediaWiki API documentation open to reference this all. We then use the requests function to get the current HTML markup of this article form the API.

Query Parameters

This is wonderful, however this step as only set up the request that we are making to our API. This has not sent any information or received any information, which is where the URL query comes into play. The “URL” is where our desired API receives requests, while our parameter is where the parameters of the queries are define. For this we will be using .json, which will be a dictionary of dictionaries.

This shows all the current information on the Versace Wikipedia page

I then created a for loop which presented me the first few paragraphs of the Versace Wikipedia article to ensure that I had the right page nd the right content on this page.

Some of the text in the Versace Wikipedia page

Next, we wanted to dive deeper into the revision history of the Versace Wikipedia page, since there just so happens to be a revision history of Wikipedia’s article, containing metadata about the who and when of the previous changes that said articles occurred. To do this we had to make sure the documents for revisions was opened, by creating a parameter for revisions vs. text. This is similar to the previous parameter code block, however you can clearly see (below) that where it once said ‘text,’ it now says ‘revisions.’

Revision request for the Versace Wikipedia page

Now, I am just making the request using the .json function, which will also overwrite our previous .json function that we wrote for the current content portion of our scrapping. We also have to check our .json_response function and the queries to return a dictionary. This dictionary will both “continue” and “query” keys, meaning that the code will continue if there are more than 500 revisions present in the article’s history and provides the coder with an index that the next query will pick up from. These queries contain the revision history that we care about, and it is buried in a nested data structure of lists and dictionaries, which will get us this revision history list of dictionaries in the end.

Checking our queries
Checking our queries part 2

After I did that, I then converted all this data into a data frame, which will show me the size of the article vs. the number of revisions made to that article.

The data frame for the Versace Wikipedia page

After this step, I then graphed this data frame so that I could visually see it better.

The code for the graph below
The Versace Wikipedia page size vs. revisions

Next, I wanted to do the same thing for the Gucci Wikipedia page, which I just followed all of the steps for the current content code, as well as the revision history code in order to create a visualization in the end that was similar to the graph above.

The code for the graph below
The Gucci Wikipedia page size vs. revisions

In conclusion we can see that by scraping the data from the Versace Wikipedia page and the data from the Gucci Wikipedia page that we were able to compare the two. What I was able to conclude was that the Gucci page not only had many more revisions but stayed relatively constant with the amount of revisions it was receiving, while the Versace Wikipedia page fluctuated every so often. The Gucci Wikipedia page also had many more fluctuations than the Versace Wikipedia page, which is interesting. This could mean that more people like Gucci rather than Versace. This could also mean that there was less correct information on the Gucci page that needed to be revised than the Versace page. I would not have been able to deduce any of this information if I did not scrape these Wikipedia articles. In the end article scraping is great when you want to create your own data sets and compare, contrast, analyze, and make conclusions based off of the data given to you in the Wikipedia page that you are looking to scrape.