I wrote a screen scraper

Michael Rossiter
4 min readJan 2, 2016

--

For a side project that I am working on, I need to set up a way to automatically collect data from the web. Where possible, application programming interfaces (APIs) present a ‘clean’ way to grab formatted data from the web. However, it requires configuration by the site owner. In several cases I identified data where the site owner did not create and API but still published data to their websites in a consistent format. This is the exact use case for a screen scraper.

So, over winter break, I wrote my first screen scraper.

The website in particular presents several hundred rows of historical data per unique ID. There are several thousand unique IDs which each have their own webpage. In theory, I could go to each webpage and copy and paste the data into Excel. Assuming a minute per page, this would have taken me over 30 hours (and my sanity). Instead, I resolved to write a script that would iteratively visit each page, download the text I wanted, format it and write it to a CSV file.

I used lxml in Python. It took a few hours to figure out how to work with lxml and then a couple more to write functions to automate the procedure over the thousands of unique IDs and write the output to CSV.

Once I determined the scraper was configured properly and returning the correct CSV output, I ran the program. At this point it was 11PM and I estimated it would take a couple hours to run, so I headed up to bed.

There’s the old adage that the rich make money even while they sleep. Laying in bed, I felt a similar exhilaration as my code ran downstairs, working hard for me. And, when I woke up in the morning, there was my data, perfectly formatted and ready for analysis.

Lessons learned:

Screenscraping (and lxml in particular) leverages the fact that webpages have a complete document object model (DOM). The DOM represents a complete hierarchical ‘tree’ of objects where each object can be identified by its relationship to its parent element. Thus, every html tag can be identified in one of two ways. If the webpage assigns classes and/or ids to each html element, then you can target those tags directly. However, when these are not assigned, you can still specify the unique DOM ‘path’.

Browsers in general and Chrome in my case provide useful tools for writing a screen scraper. In my case, the webpages did not have classes or ids associated with html tags. Therefore, I needed to specify the DOM path (called the ‘xpath’) for each object I wanted to pull. Chrome will tell you the Xpath for each object in a couple ways. In the example below, I’m grabbing the Xpath for elements on a Vice.com page. It’s a cool article about politics and technology, but not what I wrote my scraper for. I realize that would be a better example. Right now we’re keeping the project on the dl and if I shared the example it might raise questions so Vice will have to do.

First, when you ‘inspect’ the element, the Xpath shows up in a horizontal divider bar.

However, this is not very useful when dealing with heavily nested objects. Better, Chrome lets you directly copy the Xpath of each element. The Xpath for this element is “//*[@id=”yw0"]/div/h3", meaning it is the h3 element with parent div with parent element with id ‘yw0’. The Xpath is also “//html/body/section/div/article/header/div/section/div/h3”.

The data I pulled was nested in an html table. When writing tables, most developers will exclude the <tbody> element. However, the create a complete DOM, browsers will automatically insert the <tbody> element back into the page. Thus, when you pull the Xpath for an element in a table, it will include the <tbody> element. Since it does not ‘actually’ exist in the raw data source, lxml won’t be able to follow the Xpath and find the target element. Therefore, you need to check the Xpath in the actual html. I did anyway (a stressful 45 minutes).

And that’s about it. For this side project I’m looking into our ‘inbound data pipeline’ so I expect to write a number of these. Pretty fun!

--

--

Michael Rossiter

DVx Ventures launches & scales game-changing businesses. dvx.ventures | All views my own or those of others who have convinced me of them.