Web Scraping with OutWit Hub
OutWit Hub is a Web browser built for extracting data from websites. This is a great tool for people looking to scrape data off webpages.
Many websites display information in lists or tables, but don’t give users the option to download the data. This is a problem, especially with government websites that are supposed to make public information accessible to the public.
OutWit Hub can solve that problem. As an example, we’ll extract data from the Disciplinary Board of the Supreme Court of Pennsylvania.
Extracting Public Data
We see a table of recent actions taken by Pennsylvania’s Supreme Court against lawyers in the state. The table lists each attorney’s name, county and the action against him or her (ex. suspension, disbarment).
This is how the webpage looks in OutWit Hub. The browser has a menu of options on the left. Select “data,” then click “tables.”
Voila! OutWit Hub has extracted the data from the
We chose “tables” because the data are contained in an HTML
<table>. If the data were nested in an HTML list (i.e.
<ol>) instead of a
<table>, then we would choose “lists.”
We can select all the data by pressing
a on a Mac). If we don’t want everything, hold down
shift and click just the rows that we want.
After making our selection, click “catch.”
Now it’s time to sort and filter our data. Click the icon that looks like a miniature spreadsheet with a triangle:
Some data columns are empty. Others have garbled information. Let’s uncheck them.
Finally, click “export,” then choose a file format. OutWit Hub can save our data in popular file formats such as CSV, HTML and JSON. It can even insert our data into a SQL database.
Other uses for OutWit Hub
Here are a couple other ideas for OutWit Hub. Sports fans can scrape their favorite teams’ scores off the team website. Or history buffs can take ancient China’s timeline off a Wikipedia page.
The free version of OutWit Hub limits us to extracting 100 rows of data each time. The paid versions remove that limitation. I use the free version, and it suits my needs.
This article originally appeared as a post on my blog at www.gary-pang.com.