Web Scrapper Using Beautiful Soup 4
Web Scraping is a technique which is used to obtain information from web pages, which saves a lot of time and provides you with abundant data. We will be making a Web Scraper in Python using Beautiful Soup 4, which is a python library for getting data from HTML pages and saves days of work on the code. We will be parsing AccuWeather for getting the weather for Chennai, India.
We will start by installing the packages required
pip install beautifulsoup4
Next we will import the libraries required for our script
We have imported
urllib2 to parse the url which we want to scrape.
BeautifulSoup is for scrapping and modifying scrapped data.
csv is for storing our scrapped data into a CSV file, which will be discussed later. And,
datetime is used to obtain current date and time, at the time we scrape the data.
Let’s start by inspecting the webpage and information we want to scrape and store.
Or, whatever page you want to scrape. Now, because I am using Chrome, right click and inspect will give you the summary of the source of the web page. I want to get the temperature of Chennai, I will inspect the block that displays the temperature, it would look something like this.
Notice that the temperature is displayed in a
<span> block and the
large-temp. Make a note of this as it would be used in our script. Now the next thing I want to display is the other stats like pressure, wind etc. So I will inspect that block, it looks something like this.
Here, entire stats is an unordered list with block
<ul> and the
stats. We should note this also, because it will be used in our script. Now the next thing I want to scrape is the Sunrise/Sunset time. When I inspect that block, it looks something like this.
The entire block is again an unordered list
<ul> but this time, the
time-period. After noting this also, we are ready for some action. Let’s write our script!
Let’s start by specifying the url that we have to scrape
Next, we will be get our website content and will store into a variable called
webpage and then parse it using Beautiful Soup to store it in a variable called
We have used
open() function of
urllib2 to open the
url and then used
BeautifulSoup() to parse the
html.parser. Our next step is finding the block that we need in the webpage. We require 3 blocks in total, the temperature, stats, and Sunrise/Sunset time.
Here, we have used
find() function to find the block we need, by specifying the type of block, such as
<ul> and also the type of identifier,
class in this case and also the class name, like
sunrise_block. Recall that we noted these class names, when were inspecting the web page.
Next step is to remove or strip starting and trailing tags form the data obtained, so that we can get only the information we require. We can do this using
What this would do is strip() the irrelevant data and will only print the information we require. That is, we will get the current Temperature, Stats and Sunrise Time of Chennai, India.
Storing As CSV
Now that we have our output, the next step in our script is storing the temperature information into a CSV file, every time we run our code. This will help us keep a track of the temperature. We will also insert the current Date and Time with the temperature reading.
But before we do that, we have a problem to solve. CSV writer, follows a
‘utf-8’ unicode codec but when we scrape the content from the webpage using BeautifulSoap, it is not in
‘utf-8'. So, there is a clear need to convert it to
‘utf-8’ before storing in CSV.
After we are done encoding, we can now store the scrapped details into a CSV along with the Date and Time. We will use
csv.writer() for this purpose.
We have used
open() function to open a file called
‘a’ (append mode) mode. Opening a file in append mode ensures that the data that is already present in the file is not overwritten and the data is written after that. We have opened it in a variable called
csv_file, which is then used to initialize
csv.writer() object called
writer. After all the initializations, we write the data into the file using
writerow() function, in which
sunrise are passed. Also, using
datetime library function,
datetime.now(), we have stored the current date and time of the script run. The CSV file that is generated i.e.
weather.csv would look something like this.
Notice that there are separate columns for Temperature, Stats, Sunrise/Sunset and Entry Time, exactly the output we wanted.
That brings us to the end of our Web Scrapper. You can experiment by trying various URLs and introducing loops. If you are using Windows, you can also use task scheduler to scheduler your python script to get the weather everyday and can do all sorts of cool things!.
Subscribe to our mailing list to stay updated!