Scraping Website Tables with `rvest` [R]

It’s easier to write when annoyed.

JJ
Human in a Machine World
2 min readMay 4, 2016

--

I don’t understand why some websites that provide databases don’t offer an easy option to download the entire database. They’re already offering the data. And, they must have the data in a somewhat nice format to be putting it on the web…

I recently had to do some number crunching based on securities class action filings found on http://securities.stanford.edu/filings.html. Here’s what their FAQ says about downloading the data.

Seriously?!

Issue #1: This database seems to be put out on the the web for research purposes. I’d assume by now that most research teams will have a resource that can collect and aggregate the data, either manually (intern/grad student) or programmatically (web scraper). Why make it more difficult for people to obtain the data that you’re trying to provide?

Issue #2: What kind of academic research purpose would not involve some kind of publication or distribution? I can only think of non-serious ones. In which case, why bother publishing the database on the web to begin with?

I’m choosing to interpret the FAQ response as I’m okay as long as I’m using the database for research and not making money off the data directly.

So, okay. I just need a count of filings by date for my project. When I got to page 5 of copy/pasting into my Excel document, I realized that there were more than 5 webpages of filings. Then it registered that 4,149 total filings with 20 per page means that there are over 200 pages.

Ah… mental math.

Web scraper route it is!

Step 1: Find the HTML table element.

The first step involves going to the website and figuring out how to identify the table of interest. I used Google Chrome and Hadley’s `rvest` package. The copied XPath is a argument in html_nodes().

Right click table > Inspect element

Inspect Table Element in Chrome

Right click table in Elements tab > Copy > Copy XPath

Copy XPath of Table Element

Step 2: Code a scrape function.

Note that html() has been replaced with read_html().

--

--