Wikipedia Data Scraping with R: rvest in Action
Scraping list of people on bank notes for exploratory data analysis using rvest functions
Wikipedia is a a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation, currently having more than 5+ million articles in English. Today, I will work on the data exercise of wikipedia data scraping using rvest, “a new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces” (Wickham, 2014). Before you proceed, it is important for you to have a basic understanding of HTML and XML web structures. I recommend checking out HTML w3schools, which gives a good, simplified tutorial for learning, testing, and training.
Introduction to HTML
As I look through different pieces of articles in Wikipedia, it is clear to me that there is such an abounding amount of information and data with which you can scrape and analyse. As an example, I decide to work on the list of people on banknotes of different countries (Figure 1) dividing the contents into banknotes in circulation and those that are no longer in circulation.
Let’s Prepare the Data Frame Scraped from Wikipedia List of People on Bank Notes!
Importantly, let’s understand the concept of XPath and CSS Selectors from the following link.
Web Scraping With XPath
XPath is very powerful, it's kind of a query language like SQL but for XML documents, including HTML documents, so it…
3 Steps to Extract XPaths
- Select an element in the page to inspect it.
2. Move the cursor to inspect the table and your screen should highlight the table tag.
3. Right Click → Copy → Copy XPath.
You can paste the output in the notepad. If you do so for the first five tables — Albania, Angola, Argentina, Armenia, and Artsakh — you should have the output nicely printed in this way.
The Language of “rvest”
To start with, I need to specify the URL that I want to inspect the HTML structure. I specify in two types: url and url2. ‘html’ function will parse an HTML page into an XML document. The two functions below are simple examples of ‘rvest’ in action where I specifically look into the ‘body’ HTML tag element and the other one being the ‘body’ HTML tag element and ‘content’ id attribute.
Here’s what it takes to get to the Albania table.
And the complete R script I wrote to generate the data file:
I have posted this file on Kaggle. Please feel free to do further exploratory analysis, start the new kernel, and contribute to our worldwide data science community.