Motivation — At the time of writing this post, there are 1,314 posts on Upwork for web scraping projects. As I contacted a few of them and checked their requirements, it became quite obvious that their needs are not complex but rather extensive.
I started building one scraper at a time, targeting the websites the clients wanted to scrape. After three happy clients, I realized that a simple format for defining the data (data to be scrapped) and making the script generic allowed me to cut the development time from a few hours to a few minutes.
Basic scraping follows the steps:
- Requesting the page — requests library, selenium
- [optional] Asserting that data is loaded — custom script
- Parsing the HTML — BeautifulSoup library
- Retrieving the required data — BeautifulSoup library
- Storing the data — custom script
Each step on it’s own is quite easy to implement, it is the creation of a reusable script covering most of the cases that can be slightly tricky.
Using the requests library is the easiest way to go. Example:
Attempt to use this with any website that loads data dynamically and you will quickly see the issue here. It requests the page as is, without allowing any JS to run on the page. Solution? Use selenium.
The code now looks this:
[optional] Asserting that data is loaded — custom script
To assert that the data is loaded, a simple check for the existence of certain tags or text can be sufficient. For instance, consider the case of Google Careers Search Results where the results are loaded dynamically. Example:
Parsing the HTML — BeautifulSoup library
Parsing the html was already done in the previous step for checking that the data is loaded in the page, however, since the previous step is optional, parsing the HTML should be a step on its own.
For parsing HTML is python, there is the BeautifulSoup library, which is both easy to use and rather powerful. It takes one line of code to parse the HTML. Example:
Retrieving the required data — BeautifulSoup library
Retrieving data is straight forward using BeautifulSoup’s function find(). Example:
Putting Everything Together
If you’ve made it this far, you realize that implementing your own scraper in python is a pretty easy task, the challenge comes when you want to put it all together to get a generic scraper that you can use over and over again in order to finish scraping jobs in tenth of the time. That’s exactly what I went on to do, however, going through all the implementation details here would be an overkill, as well as a duplication of the scraper’s documentation (the one I built).
You can find the generic scraper with its documentation in this Github repository.
On the other hand, if you choose to write your own based on the knowledge within this article, then you need to create a format for representing the tags that contain the data you want, in my case, I re-used the one that BeautifulSoup uses, with additions that you can read about in the documentation for my scraper function here
Web scraping can be trivial and get complex, this article attempts to explain how to build a slightly powerful, but mostly generic scraper that can be used for most of the scraping that is requested from the everyday developer, as for the more complex ones, I would recommend checking out some of the scraping libraries such as Scrapy, or build your own.