Web Scrapper Using Beautiful Soup 4
Hello
Web Scraping is a technique which is used to obtain information from web pages, which saves a lot of time and provides you with abundant data. We will be making a Web Scraper in Python using Beautiful Soup 4, which is a python library for getting data from HTML pages and saves days of work on the code. We will be parsing AccuWeather for getting the weather for Chennai, India.
We will start by installing the packages required
pip install beautifulsoup4
Next we will import the libraries required for our script
We have imported urllib2
to parse the url which we want to scrape. BeautifulSoup
is for scrapping and modifying scrapped data. csv
is for storing our scrapped data into a CSV file, which will be discussed later. And, datetime
is used to obtain current date and time, at the time we scrape the data.
Let’s start by inspecting the webpage and information we want to scrape and store.
Visit the url : http://www.accuweather.com/en/in/chennai/206671/current-weather/206671
Or, whatever page you want to scrape. Now, because I am using Chrome, right click and inspect will give you the summary of the source of the web page. I want to get the temperature of Chennai, I will inspect the block that displays the temperature, it would look something like this.
Notice that the temperature is displayed in a <span>
block and the class
is large-temp
. Make a note of this as it would be used in our script. Now the next thing I want to display is the other stats like pressure, wind etc. So I will inspect that block, it looks something like this.
Here, entire stats is an unordered list with block <ul>
and the class
is stats
. We should note this also, because it will be used in our script. Now the next thing I want to scrape is the Sunrise/Sunset time. When I inspect that block, it looks something like this.
The entire block is again an unordered list <ul>
but this time, the class
is time-period
. After noting this also, we are ready for some action. Let’s write our script!
Let’s start by specifying the url that we have to scrape
Next, we will be get our website content and will store into a variable called webpage
and then parse it using Beautiful Soup to store it in a variable called soup
.
We have used open()
function of urllib2
to open the url
and then used BeautifulSoup()
to parse the webpage
using html.parser
. Our next step is finding the block that we need in the webpage. We require 3 blocks in total, the temperature, stats, and Sunrise/Sunset time.
Here, we have used find()
function to find the block we need, by specifying the type of block, such as <span>
or <ul>
and also the type of identifier, class
in this case and also the class name, like large-temp
for temp_block
, stats
for stats_block
and time-period
for sunrise_block
. Recall that we noted these class names, when were inspecting the web page.
Next step is to remove or strip starting and trailing tags form the data obtained, so that we can get only the information we require. We can do this using strip()
function.
What this would do is strip() the irrelevant data and will only print the information we require. That is, we will get the current Temperature, Stats and Sunrise Time of Chennai, India.
Storing As CSV
Now that we have our output, the next step in our script is storing the temperature information into a CSV file, every time we run our code. This will help us keep a track of the temperature. We will also insert the current Date and Time with the temperature reading.
But before we do that, we have a problem to solve. CSV writer, follows a ‘utf-8’
unicode codec but when we scrape the content from the webpage using BeautifulSoap, it is not in ‘utf-8'
. So, there is a clear need to convert it to ‘utf-8’
before storing in CSV.
After we are done encoding, we can now store the scrapped details into a CSV along with the Date and Time. We will use csv.writer()
for this purpose.
We have used open()
function to open a file called ‘weather.csv’
in ‘a’
(append mode) mode. Opening a file in append mode ensures that the data that is already present in the file is not overwritten and the data is written after that. We have opened it in a variable called csv_file
, which is then used to initialize csv.writer()
object called writer
. After all the initializations, we write the data into the file using writerow()
function, in which temp
, stats
, sunrise
are passed. Also, using datetime
library function, datetime.now()
, we have stored the current date and time of the script run. The CSV file that is generated i.e. weather.csv
would look something like this.
Notice that there are separate columns for Temperature, Stats, Sunrise/Sunset and Entry Time, exactly the output we wanted.
That brings us to the end of our Web Scrapper. You can experiment by trying various URLs and introducing loops. If you are using Windows, you can also use task scheduler to scheduler your python script to get the weather everyday and can do all sorts of cool things!.
Subscribe to our mailing list to stay updated!
GoodBye!