Web Scraping with OutWit Hub

OutWit Hub is a Web browser built for extracting data from websites. This is a great tool for people looking to scrape data off webpages.

Many websites display information in lists or tables, but don’t give users the option to download the data. This is a problem, especially with government websites that are supposed to make public information accessible to the public.

OutWit Hub can solve that problem. As an example, we’ll extract data from the Disciplinary Board of the Supreme Court of Pennsylvania.

Website for the Disciplinary Board of the Supreme Court of Pennsylvania

Extracting Public Data

We see a table of recent actions taken by Pennsylvania’s Supreme Court against lawyers in the state. The table lists each attorney’s name, county and the action against him or her (ex. suspension, disbarment).

How the website looks on OutWit Hub

This is how the webpage looks in OutWit Hub. The browser has a menu of options on the left. Select “data,” then click “tables.”

Voila! OutWit Hub has extracted the data from the <table>.

We chose “tables” because the data are contained in an HTML <table>. If the data were nested in an HTML list (i.e. <ul> or <ol>) instead of a<table>, then we would choose “lists.”

We can select all the data by pressing ctrl-a (or and a on a Mac). If we don’t want everything, hold down shift and click just the rows that we want.

After making our selection, click “catch.”

Now it’s time to sort and filter our data. Click the icon that looks like a miniature spreadsheet with a triangle:

Let’s filter our data. Remove information that we don’t need.

Some data columns are empty. Others have garbled information. Let’s uncheck them.

OutWit Hub can save data as CSV, JSON and other file formats.

Finally, click “export,” then choose a file format. OutWit Hub can save our data in popular file formats such as CSV, HTML and JSON. It can even insert our data into a SQL database.

Other uses for OutWit Hub

Here are a couple other ideas for OutWit Hub. Sports fans can scrape their favorite teams’ scores off the team website. Or history buffs can take ancient China’s timeline off a Wikipedia page.

The free version of OutWit Hub limits us to extracting 100 rows of data each time. The paid versions remove that limitation. I use the free version, and it suits my needs.

This article originally appeared as a post on my blog at www.gary-pang.com.