Web Scraping Times of India with Python and Beautifulsoup4

Mayuresh Kadam
Analytics Vidhya
Published in
5 min readMar 28, 2020

In this article,we will go through an example which I had done it back in November 2019 however I’m sharing it with you all now. Considering the current situation across the globe owing to the CoronaVirus which has bought the world to its knees,many businesses have been shut and companies,MNC’s advised to work from home. Considering the situation,many newspaper agencies have stopped distributing newspapers in India. People from Non-IT background are reading it online however this article is mainly focussed towards people enhancing their skillsets in Python and Webscraping. Any person working in the industry or who has basic knowledge of Python can scrape any website with the help of Python and BeautifulSoup4 library.

But before we dive deep into the program let me give you a brief idea about Webscraping.

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

How Do WebScrapers work?

Automated web scrapers work in a rather simple but also complex way. After all, websites are built for humans to understand, not machines.

First, the web scraper will be given one or more URLs to load before scraping. The scraper then loads the entire HTML code for the page in question. More advanced scrapers will render the entire website, including CSS and Javascript elements.

Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run.

Ideally, the user will go through the process of selecting the specific data they want from the page. For example, you might want to scrape an Amazon product page for prices and models but are not necessarily interested in product reviews.

Lastly, the web scraper will output all the data that has been collected into a format that is more useful to the user.

Most web scrapers will output data to a CSV or Excel spreadsheet, while more advanced scrapers will support other formats such as JSON which can be used for an API.

This is an informative exercise for beginners who are looking how to scrape websites. I am breaking down the tutorial into smaller parts so that you can understand each step better.

Prerequisites: Python3, BeautifulSoup4.

I am assuming you have your python installed on your machine. If its not please install it from https://www.python.org/downloads/ To install beautifulsoup4 library type- pip install beautifulsoup4

BeautifulSoup: Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

A Few pointers i’d like to give my fellow Pythonistas:

  1. Read the website’s terms and conditions to understand how you can legally use their data. Most of the sites prohibit you from using the data for commercial purposes.
  2. Please make sure that you do not download their data at a rapid rate as that may break the website which could potentially get you blocked!

Inspecting the Website

The first thing that we need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags. Simply put, there is a lot of code on a website page and we want to find the relevant pieces of code that contains our data. If you are not familiar with HTML tags, refer to W3Schools Tutorials. It is important to understand the basics of HTML in order to successfully web scrape.

On the website, right click and click on “Inspect”. This allows you to see the raw code behind the site.

Once you have clicked on “Inspect” ,You Should see this console pop up.
Notice that on the top left corner of the console there is an arrow symbol.

If you click on the arrow symbol and then click on an area of the site itself, the code for that particular item will be highlighted in the console. Or you could directly go to the Website and then carry out these tasks.

Final Stop! The Python Code

After successfully accessing the URL we parsed the HTML with BeautifulSoup so that we can work with a manageable BeautifulSoup Data Structure. Beautifulsoup is a versatile library,I highly recommend you to check out their documentation. We then used the method .findAll() to locate all our <div> tags using the same technique we used the method .findAll() to locate all our <ul> tags within the divtags and we are simply setting the counter to less than equal to 10 as we want the top 10 Headlines for the day. As a final step we applied the same technique we used to find all <li> tags within <ul>tags and print the list of Top 10 headlines.

Here’s the Result:-

Got Top 12 Headlines for the day instead of 10!

As a Pythonista your Journey into Webscraping has just begun, there are other Webscraping Libraries as well. I recommend you to study Scrapy as it is more advanced than BeautifulSoup. You can Find my Jupyter Notebook for this example on my Github.

Alternatively, you can also get a copy of Hands-On Web Scraping with Python to learn in depth about webscraping using beautifulsoup4 as well as scrapy.

More about the book here: https://amzn.to/3K6HiCu

Thanks for reading and happy webscraping!

--

--

Mayuresh Kadam
Analytics Vidhya

Data Science enthusiast who happens to be a Data Engineer.