What I learned from scraping over 15,000 web pages

In the grand scheme of the web, 15,000 web pages is a drop in the bucket. However, you can learn a lot by sourcing and scraping that many from across a diverse set of sites. In building a Python app to find and Tweet out interesting data science content, I had to gather a lot of potential articles, videos and blog posts to work with and then scrape those to learn more about them. Here’s some of what I learned along the way.

Finding pages to scrape is a task in itself

I talked a little bit about how I sourced URLs to go after in the post I linked above, but getting started was harder than I initially thought. One of my main sources were links people posted on Twitter. It’s a good start but there are downsides.

  • People often post irrelevant URLs or none at all in their Tweets. Twitter is used for more than sharing links after all.
  • Some topics and keywords in particular lead to “spam” links. For example, search Twitter for things like “entrepreneurship” and “remote work” and you’ll find a lot of get rich quick schemes. In my case those links are of no value.
  • A great number of people tweet out the same URLs, so getting distinct ones is a challenge when you want to gather a lot in a short amount of time.

There are plenty of other ways to find good URLs to go after. It seems a bit recursive, but you can do so by scraping some common sites, grabbing URLs and then scraping those URLs. YouTube, Reddit, and your favorite blogs and RSS feeds will do nicely.

You’ll find the limits of your scraper and piss off some people along the way

I wrote my scraper in Python and relied most on the requests and beautifulsoup packages. Neither give you exactly what you want out of the box, but they’re excellent building blocks. They give you the power to request any URL, grab the resulting HTML and pull what you want from it. Sounds easy, right? Well, here are a few mistakes and poor assumptions I made along the way.

I didn’t anticipate people would share links to large files — We all share blog posts and articles on Twitter and Reddit but what about ZIP files, Google Docs and massive PDFs? Well, I don’t but apparently it’s a thing. If you just go out and try to load all that data into a Beautiful Soup object you’ll pay the price. This was the first (of a few) times my hosting service sent me a friendly email telling me they killed one of my jobs and I should be more careful. Side note — It’s better to move from a shared server to a private one :)

Here’s a quick and dirty way to avoid massive files, but you may still fall into a trap if the content-length property doesn’t come back in the headers or it’s value isn’t accurate (it happens).

Some sites do their best to reject scrapers, but not many — I wrote an article a while back titled Ethics in Web Scraping in which I discussed some best practices in respecting site owners when you’re scraping as well as respecting scrapers as an owner. To my surprise, it turns out that the vast majority (over 99%) of site owners are very reasonable and allow scraping as long as you’re not slamming their site. There are however some that block you unless you fake your headers to look like a human, but I still prefer to be honest and identify myself truthfully. In scraping these 15K+ pages I didn’t get a single contact from a site owner despite sharing my email in the request headers.

There are a lot of sites out there

Out of the 15,000+ URLs I went through, there were over 3,000 domains. I was surprised there wasn’t more dominance from the well known sites. It’s actually a bit refreshing to see the results. Keep in mind this is for data science related content, so even some “big” sites like the Wall Street Journal are sparse in this context, but none the less I wasn’t expecting such a variety of personal blogs and other sites. Here’s a look at the distribution.

# of URLs by Site

I wish more pages supported Open Graph, and that Open Graph was enough

If you’re looking to get common meta-data from a web page, Open Graph is a blessing. If you haven’t heard of Open Graph, it’s what makes sharing URLs on social networks so easy and consistent. Want the title, author and image for an article when you share it on Twitter? Just grab those properties as defined in the Open Graph protocol from the HTML and be on your way.

The bad news?

  • Only about 80% of pages I scraped supported Open Graph
  • You can’t get all that you need from Open Graph properties if you’re doing deeper analysis. Title, the date published, and share images are great, but if you’re a data geek like me you’ll need to extract more from the HTML to get the raw materials you need. That’s a topic for another day, but a massive challenge given a lack of standards.

Don’t worry about paywalls

I had originally considered skipping over URLs from well-known paywalled sites, but you can at least get basic meta-data (often Open Graph properties) from such pages as well as summary and keyword information. And don’t worry, the site owners are usually willing to let you scrape what they provide publicly without any pushback.

Review and revise your scraping logic often

When I first built my scraper and fed it URLs to go after, I worked out a few bugs and then let it go for a few days unattended unless it blew up. What I didn’t realize is that I was missing out on some valuable data and had to go back and re-scrape as I made improvements.

For example, some pages don’t support Open Graph but there are common fallback elements that you can grab if needed. A good example is the title of the content. I’d prefer to get it from theog:title property defined in Open Graph, but if it’s not there <Title> usually does a reasonable job.

In addition, I realized that I had to tweak my logic for dealing with the large file size issue I noted above. I suggested checking the content-length property in the headers of the page before you try to load and parse it. I also noted that not all pages return that property in the header. If you want to be really safe, just ignore any page without that property. If you want to get risky, try reading it in chunks though it’s a little harder to work with that way.

In summary

Web scraping is nothing new, and it’s a very common practice. None the less, it’s still one of the most valuable ways to collect data for just about any analysis or content collection exercise. There’s also no single way to get it right so ensure you use best practices, act ethically and experiment with your logic often.