Wikipedia Data Scraping with R: rvest in Action

Scraping list of people on bank notes for exploratory data analysis using rvest functions

Introduction

Wikipedia is a a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation, currently having more than 5+ million articles in English. Today, I will work on the data exercise of wikipedia data scraping using rvest, “a new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces” (Wickham, 2014). Before you proceed, it is important for you to have a basic understanding of HTML and XML web structures. I recommend checking out HTML w3schools, which gives a good, simplified tutorial for learning, testing, and training.

Image Source: Chandransh Srivastava, An Engineer seeking Wisdom, See https://www.quora.com/With-knowledge-in-HTML-CSS-and-a-little-JavaScript-what-kind-of-projects-should-I-start-with-to-strengthen-my-skills-in-web-development
Image Source: Brad Yale (2017), See https://medium.com/healthwellnext/how-to-write-html-part-2-understanding-tags-fd8fc583a06a

As I look through different pieces of articles in Wikipedia, it is clear to me that there is such an abounding amount of information and data with which you can scrape and analyse. As an example, I decide to work on the list of people on banknotes of different countries (Figure 1) dividing the contents into banknotes in circulation and those that are no longer in circulation.

Figure 1: Here’s the first look on the Wikipedia page containing the tables of banknote data, to put it simply, This is a list of people on the banknotes of different countries. The customary design of banknotes in most countries is a portrait of a notable citizen (living and/or deceased) on the front (or obverse) or on the back (or reverse) of the banknotes, unless the subject is featured on both sides. (Source: https://en.wikipedia.org/wiki/List_of_people_on_banknotes)

Let’s Prepare the Data Frame Scraped from Wikipedia List of People on Bank Notes!

Importantly, let’s understand the concept of XPath and CSS Selectors from the following link.

3 Steps to Extract XPaths

  1. Select an element in the page to inspect it.

2. Move the cursor to inspect the table and your screen should highlight the table tag.

3. Right Click → Copy → Copy XPath.

You can paste the output in the notepad. If you do so for the first five tables — Albania, Angola, Argentina, Armenia, and Artsakh — you should have the output nicely printed in this way.

Figure 2: The first 5 xpaths for Albania, Angola, Argentina, Armenia, and Artsakh. You probably start to see the opportunity to write a for-loop operation to retrieve all these tables in one shot!

The Language of “rvest”

To start with, I need to specify the URL that I want to inspect the HTML structure. I specify in two types: url and url2. ‘html’ function will parse an HTML page into an XML document. The two functions below are simple examples of ‘rvest’ in action where I specifically look into the ‘body’ HTML tag element and the other one being the ‘body’ HTML tag element and ‘content’ id attribute.

Here’s what it takes to get to the Albania table.

And the complete R script I wrote to generate the data file:

Output File

Figure 3: an extracted data of the list of people on banknotes.

I have posted this file on Kaggle. Please feel free to do further exploratory analysis, start the new kernel, and contribute to our worldwide data science community.

Caveats in Web Scraping and Web Crawling

Resources

Korkrid Kyle Akepanidtaworn

Written by

Cloud Solution Architect (Data & AI) at Microsoft, Former Data Scientist at Accenture Applied Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade