Web Scrapping: the PISA 2015 Rankings
The results of PISA 2015 (Programme for International Student Assessment) have been released on December 6th, 2016. This cycle mainly focused on the science domain, while it also measured students’ mathematics and reading performance. The rankings of each domain have been updated in Wikipedia . When browsing the website, the ranking table is in the PISA 2015 headline session. Here, I am going to demonstrate how to extract the ranking table from the Wikipedia website by using the “rvest” package in R.
The functions are used in the “rvest” are: (1) read_html( ), (2) html_nodes ( ) , and (3) html_table ( ). read_html is to read HTML; html_nodes is to find the first node matches a selector; html_table is to extract a content and parse a table into a dataframe structure . Prior to beginning to extract the ranking table, I use selectorgadget to find the table . If you haven’t heard about selectorgadget, please visit their website to watch further instructions
From a below image, “selectorgadget” selects tables. The selector matches “table” that I want, and 12 table nodes are shown in a “Clear” box (Figure 1).

After using”selectorgadget” to find the CSS selector- “table”, I used the “rvest” package to extract the table from the website (Figure 2). Please note there are 12 tables, but I only extracted the rankings table, which was combined with tables[3], table[4], and table[5]. tables[3], table[4], and table[5] are in the line 6, 8, and 10 (Figure 2).
I took a screenshot of the table in an R environment (Figure 3). This screenshot shows rankings, countries, mathematics scores and science scores.

