A Primer on Web Scraping

By Eamon Cagney

As the resident tech guy at Trendify, I’ve spent a good part of my time scouring the internet for data, something I had virtually no experience with when I started. After self-learning a few languages and combing through multiple websites, here are some tips I’ve noticed along the way for any of you tech-oriented people that might be looking to be doing something similar:

Watch out for the newline character:

Just a funny thing that tripped me up when I started learning new languages. For anyone who’s ever switched from a C-style language to some other style, (in my case mostly PHP and Ruby), you may have noticed something a little different when reading from a file. I had a file full of URLs and had written a script to follow the URLs and check for relevant data. However, it wasn’t working and for the life of me I couldn’t figure out why. After going through pretty much the entire process, I found I was getting 404 Bad Requests on all the headers. What was weirder, when I manually entered the addresses into the program it worked fine. After scratching my head for a little while, I figured the only explanation of why 2 identical addresses were working differently would actually have to be something invisible. So I tried chopping the last two characters off of the lines I was reading in and it worked perfectly. I hadn’t anticipated (since I hadn’t really properly learned the language) that PHP reads in the character(s) at the end of a line marking the next line, something that doesn’t occur in C++ or Java, the languages familiar to me. Not a huge issue, but a funny story, and something worth looking out for when you’re trying to learn new languages.

Be as broad as you can without being inaccurate:

One thing you don’t have when building a scraper is human intuition. Sure, a person might be looking for John Smith and see John R. Smith and think: probably the same person. A computer that’s been told to check specifically for “John Smith” will see John R. Smith and say: no match. A big part of scraping for data is that you actually have the right data in front of you. The verification process is very important, but being creative can help ensure that you don’t turn away the right data. A simple fix for issues with middle names and nicknames (that is searching a nickname and getting a full name, the other way around is a bit more complicated) is to use the space to break the name into two separate variables and check that both are there. “John” and “Smith” are both contained in “John R. Smith”, and “Matt” is contained in “Matthew”. When I was trying to verify I had the right company, I would generally get rid of all “ Inc”s, “, Inc”s, “ Inc.”s, and “, Inc.”s and similarly with Ltds and such. The trick is to make it as unspecific as possible while still keeping it very unlikely that you could mistake the wrong data for the right stuff.

And finally, know when to stop:

There’s definitely something more satisfying about writing a program to go fetch data for you rather than opening up your browser of choice and fetching it yourself. But is it worth it to create something to go through a new website if it’s only going to get you 50 more data points? Probably not. Like I mentioned before, sometimes a little human intuition is just a more efficient choice, especially for small jobs. At some point it’ll save you some precious time to throw in the towel and just do a little grunt work. It may not be fun, but it’s worth it.

And there you have it, some fun/useful things I’ve come across in my adventures. Hope this has been helpful (or at least a little entertaining). Happy hunting!