Building your first website scraper

Have you ever wanted to be able to extract data from an online directory into a Excel Spread sheet (.csv file) and have been putting it off because it seemed too daunting a task? If yes, read on.

Before we start, what do you need to know?

Fear not if you do not have any of the below, I will share easy to understand resources that you can use to bring yourself up to speed.

  1. A basic understanding of Python programming and it’s concepts
  2. An understanding of how HTML is set up

You can skip the below, if you are confident in the above 2 requirements.

A basic understanding of Python programming and it’s concepts

You will need to go through the below videos to bring your self upto speed. Also all the content in these videos can be read from this book (Python for Informatics) and if you have completely no knowledge of Python and want to invest a few weeks (~1–4 weeks) into it, I highly recommend this series by Dr. Chuck. ( https://www.youtube.com/watch?v=G721cooZXgs&list=PLt7ksebytagusNfodMu720fjOLwWlnb7S ).

If you have a good understanding of the basics and are pressed for time, run through these below section.

Section I : Understanding urllib library

The below video should walk you through the basics of urllib. ( a important side note AFTER WATCHING THE VIDEO is that urllib.urlopen(your_url) is valid too)

Section II : Understanding how to convert a list of python dictonaries into a .csv

Click on the below link and read through the first answer.

Section III : BeautifulSoup module

BeautifulSoup is a python module which makes scraping web-pages (or even XML data to an extent) a breeze.

By default BeautifulSoup is not installed on your machine. Follow the bellow to install it.

MAC/Linux : open terminal and type the following: sudo pip install BeautifulSoup

PC : https://stackoverflow.com/questions/12228102/how-to-install-beautiful-soup-4-with-python-2-7-on-windows


An understanding of how HTML is set up

Go through the below 2 videos to understand this.


Let’s get to it, how do we get our first web scraper up and running ?

The below video should give you a good understanding of how the above technologies mix up to create a web scraping tool.

Having done watching the above video, you should know be very comfortable with the concept of website scraping using python, but in order to truly convert directory like websites into .csv files we must add one more concept. The concept is called web crawling .

So what is a web crawler and how does it work ?

0:24 to 2:06 is really all what you need to know at this stage.

Scraping a web directory and storing it into a .csv file

We can achieve this by the below algorithm/logic:

  1. Scrape the parent page and store all the links pointing in a list
  2. iterate through that list and scrape through each link for the data you need, storing it in a python default dictionary, which in turn is appended to a list (let’s call it results)
  3. The default dict / dict is written into a .csv file using a function (whose code looks like this)

Key-points to remember

  1. Be polite: If someone has set up robots.txt on their website please do NOT crawl their website. If someone sets this up they are requesting you not to run a script that accesses their website.
  2. Avoid making multiple requests to the server while testing: Making multiple requests to a server which hosts a website over a short period of time can be viewed by the server as a Denial of Service Attack and it might black list your IP address. Try limiting the number of requests you make in a period of time.