Using site-scrapers to build sample datasets & visualizations
Over the weekend, while rebuilding a site for my wife, I found myself in need of a site-scraper. Site-scrapers are tools that will extract structured content from web sites or apps; there are many free ones and vary widely in complexity (command-line tools with no UI) to simpler ones. Her old website was built with Expression Engine 1 years ago, the admin CP was broken, and I didn’t want to try to make sense of the messy database or other export tools — data rot at its finest.
I stumbled upon a pretty decent free one from import.io — you can read a review here. They have a nice set of freemium tools, and let me do a quick scan of her blog archives.
There’s a lot to it, and I think they’ve done a great job of creating tools and a user experience that lets non-technical people extract clean, structured data. Here’s a look at my first try, after I “trained” it by clicking on the entries in the old blog’s archive list:
Once you’ve built your dataset, you can instantly use it in a variety of ways, including GET/POST, import into Google Sheets, and even an API as JSON:
Unfortunately for me, I wasn’t able to get the tool to “dig” a level deeper and extract content using their Crawler tool; there was some issue with my server that wouldn’t let it probe it. As I was in a rush, and there was only like 20 entries, I fell back on my half-automated process for converting web content to Markdown files in NVAlt.
Import.io has a nice UI for helping you understand the kind of data source you’re trying to scrape — a list of things, or a single page like a product page. It has the concept of “training” to build a tool that is customized for a given site or source.
“Behind the Beats”
One neat example from the Import.io blog is Behind the Beats, a nice visualization of “music samples used by beats producers through the years”. They used whosampled.com, Import.io to generate the datasets, Excel, RAW, Gephi, and Adobe Illustrator to create the visualizations:
I’m always fascinated by “bridge” tools: tools that connect things to other things. Other examples include Zapier, Gerty, Slack, and [the now-defunct] Yahoo Pipes. They help break down walls between the isolated islands of the vast archipelago of cloud tools and sites out there — and help us understand and make use of it in new ways.