Things No One told you about Web Scraping

Image credit : http://webdata-scraping.com/

After burying my head for hours (sometimes days) in the utter ugly html, javascript and css codebases , inspecting every single element on the page ,tracking networks for multiple requests , I finally managed to Scrape the data out of the target website. During this a month long scraping task (as a part of Data Acquisition) , I learned some facts about web scraping which even after reading dozens of blogs, tutorials and documentation, was missed to be conveyed to the readers . So, here are somethings no one told you about web scraping .

Scraping is More than using just libraries

There exists a plethora of libraries meant for scraping the contents off the web , in almost every major programming/scripting language. But the point is , you should never be dependent on the library to effectively pull the target data in your desired format . You need to be ready to hack the library and the process itself when it comes to that . None of the library can provide everything needed for you to extract the data off your screen . Libraries are all about handling underlying things for you like Sessions, text extraction, tags handling,etc but you gotta know what’s happening under the hood . Your choice of library should only depend on the maturity and easiness of the library in consideration, that’s about it . If you do scraping long enough, you will know every library has a bottleneck , and better be ready with an alternative approach than rather be stuck at a library issue.

Scraping is a responsible task

You should always understand the gravity of the task you are performing while scraping . You are actually taking someone else’s data for your own use (with and sometimes without the owner’s permission) . And while doing so, you should be responsible in the process as to not harm your target website in any way possible. Realize that every request you send on his server is going to cost him real money . Be specific in your requests, include sleep your program for subtly hitting the server and not bombard it with requests. Always catch your exceptions and store the results incrementally, to not resend the previous sequence of requests again and again every time your script crashes and you restart it. If you are taking data from someone, be responsible about it .

Scraping is a Sherlock’s Job

Be no mistaken, scraping data off real websites (especially the age old maintained government websites and super scraping proof modern websites) is no child’s play and nothing like super fluent as presented to you in the tutorials . You have to be a keen observant and patient investigator (much like Mr. Holmes ;) ) to be able to understand the flow of data that comes on your target page . Also, you need to be wearing your thinking caps on while pinning down on the element path for your desired data . Sometimes the smallest of elements like fonts,style can give in your data which the most of sophisticated of regexes and elements wont be giving off. Having said that, the key to applying that kind of observation is experimentation. The more you experiment and think out of box, the more magical experience you will have while scraping .

Scraping can be depressing

Scraping sometimes can be a real pain in the ass and irritate you to the core . Even after applying all your methods in your arsenal, the data isn’t quite what you wanted or is super dirty to be of your use . When you feel like its impossible and it ‘CANNOT’ be done , just take a break and restart with the fresh mindset . Trust me , you will never get the data in the first or possibly second and the third attempt, but you have to keep investigating the data flow and structure for entry points in patterns of the data you need . Remember , leave the workstation when you get stuck like this . Restart with a fresh mind and a hot coffee.

Scraping is data preprocessing in disguise

There will be times when you have everything in place and yet your data isn’t the way you wanted . Or possibly you will encounter encoding issues and your write will fail . Your scraped data will be of no use if it’s not properly formatted and encoded .You need to understand these text processing workflows and issues to be able to be skim the data out the scraped junk . If you are not able to manipulate the text properly to your need, your scraped data and all your efforts will be rendered useless .

I hope these points help and guide you the next time you take up scraping data off someone’s websites . Feel free to comment your suggestions and viewpoints about all of the above .