Using site-scrapers to build sample datasets & visualizations

Allan White
3 min readNov 9, 2015

--

https://classicdozers.wordpress.com — Giant Tonka!

Over the weekend, while rebuilding a site for my wife, I found myself in need of a site-scraper. Site-scrapers are tools that will extract structured content from web sites or apps; there are many free ones and vary widely in complexity (command-line tools with no UI) to simpler ones. Her old website was built with Expression Engine 1 years ago, the admin CP was broken, and I didn’t want to try to make sense of the messy database or other export tools — data rot at its finest.

I stumbled upon a pretty decent free one from import.io — you can read a review here. They have a nice set of freemium tools, and let me do a quick scan of her blog archives.

There’s a lot to it, and I think they’ve done a great job of creating tools and a user experience that lets non-technical people extract clean, structured data. Here’s a look at my first try, after I “trained” it by clicking on the entries in the old blog’s archive list:

Once you’ve built your dataset, you can instantly use it in a variety of ways, including GET/POST, import into Google Sheets, and even an API as JSON:

Unfortunately for me, I wasn’t able to get the tool to “dig” a level deeper and extract content using their Crawler tool; there was some issue with my server that wouldn’t let it probe it. As I was in a rush, and there was only like 20 entries, I fell back on my half-automated process for converting web content to Markdown files in NVAlt.

Import.io has a nice UI for helping you understand the kind of data source you’re trying to scrape — a list of things, or a single page like a product page. It has the concept of “training” to build a tool that is customized for a given site or source.

“Behind the Beats”

One neat example from the Import.io blog is Behind the Beats, a nice visualization of “music samples used by beats producers through the years”. They used whosampled.com, Import.io to generate the datasets, Excel, RAW, Gephi, and Adobe Illustrator to create the visualizations:

I’m always fascinated by “bridge” tools: tools that connect things to other things. Other examples include Zapier, Gerty, Slack, and [the now-defunct] Yahoo Pipes. They help break down walls between the isolated islands of the vast archipelago of cloud tools and sites out there — and help us understand and make use of it in new ways.

What data tools do you use? What are your favorite visualization tools?

--

--

Allan White

Design Lead for Datica Health, living in Portland, Oregon. Let’s talk on twitter at @allanwhite.