Web Scrapper Using Beautiful Soup 4

Published in

the ML blog

5 min readJun 16, 2017

Hello

Web Scraping is a technique which is used to obtain information from web pages, which saves a lot of time and provides you with abundant data. We will be making a Web Scraper in Python using Beautiful Soup 4, which is a python library for getting data from HTML pages and saves days of work on the code. We will be parsing AccuWeather for getting the weather for Chennai, India.

We will start by installing the packages required

pip install beautifulsoup4

Next we will import the libraries required for our script

Imports For Web Scrapper

We have imported urllib2 to parse the url which we want to scrape. BeautifulSoup is for scrapping and modifying scrapped data. csv is for storing our scrapped data into a CSV file, which will be discussed later. And, datetime is used to obtain current date and time, at the time we scrape the data.

Let’s start by inspecting the webpage and information we want to scrape and store.

Visit the url : http://www.accuweather.com/en/in/chennai/206671/current-weather/206671

Or, whatever page you want to scrape. Now, because I am using Chrome, right click and inspect will give you the summary of the source of the web page. I want to get the temperature of Chennai, I will inspect the block that displays the temperature, it would look something like this.

Notice that the temperature is displayed in a <span> block and the class is large-temp. Make a note of this as it would be used in our script. Now the next thing I want to display is the other stats like pressure, wind etc. So I will inspect that block, it looks something like this.

Here, entire stats is an unordered list with block <ul> and the class is stats. We should note this also, because it will be used in our script. Now the next thing I want to scrape is the Sunrise/Sunset time. When I inspect that block, it looks something like this.

The entire block is again an unordered list <ul> but this time, the class is time-period. After noting this also, we are ready for some action. Let’s write our script!

Let’s start by specifying the url that we have to scrape

URL To Scrape

Next, we will be get our website content and will store into a variable called webpage and then parse it using Beautiful Soup to store it in a variable called soup.

Parsing And Storing

We have used open() function of urllib2 to open the url and then used BeautifulSoup() to parse the webpage using html.parser. Our next step is finding the block that we need in the webpage. We require 3 blocks in total, the temperature, stats, and Sunrise/Sunset time.

Finding The Blocks

Here, we have used find() function to find the block we need, by specifying the type of block, such as <span> or <ul> and also the type of identifier, class in this case and also the class name, like large-temp for temp_block, stats for stats_block and time-period for sunrise_block. Recall that we noted these class names, when were inspecting the web page.

Next step is to remove or strip starting and trailing tags form the data obtained, so that we can get only the information we require. We can do this using strip() function.

Stripping And Printing

What this would do is strip() the irrelevant data and will only print the information we require. That is, we will get the current Temperature, Stats and Sunrise Time of Chennai, India.

Output

Storing As CSV

Now that we have our output, the next step in our script is storing the temperature information into a CSV file, every time we run our code. This will help us keep a track of the temperature. We will also insert the current Date and Time with the temperature reading.

But before we do that, we have a problem to solve. CSV writer, follows a ‘utf-8’ unicode codec but when we scrape the content from the webpage using BeautifulSoap, it is not in ‘utf-8'. So, there is a clear need to convert it to ‘utf-8’ before storing in CSV.

Encoding

After we are done encoding, we can now store the scrapped details into a CSV along with the Date and Time. We will use csv.writer() for this purpose.

Writing To CSV

We have used open() function to open a file called ‘weather.csv’ in ‘a’ (append mode) mode. Opening a file in append mode ensures that the data that is already present in the file is not overwritten and the data is written after that. We have opened it in a variable called csv_file, which is then used to initialize csv.writer() object called writer. After all the initializations, we write the data into the file using writerow() function, in which temp, stats, sunrise are passed. Also, using datetime library function, datetime.now(), we have stored the current date and time of the script run. The CSV file that is generated i.e. weather.csv would look something like this.

Notice that there are separate columns for Temperature, Stats, Sunrise/Sunset and Entry Time, exactly the output we wanted.

That brings us to the end of our Web Scrapper. You can experiment by trying various URLs and introducing loops. If you are using Windows, you can also use task scheduler to scheduler your python script to get the weather everyday and can do all sorts of cool things!.

Subscribe to our mailing list to stay updated!

GoodBye!

Web Scrapper Using Beautiful Soup 4

Storing As CSV

Written by Nishank Sharma