Web Scraping in Python For Novice.

Vinodnethichinna
Analytics Vidhya
Published in
4 min readJun 11, 2020

You are in the right place if you are doing web scraping for the first time. I am putting together every part where I faced issues when I did web scraping for the first time. Don’t worry if you are using Python for the first time, Not sure about syntax — Be Cool, Don’t know Libraries — Be Cool. Just stay with me, we will explore things one by one and make the fruit delicious.

We are going to use Jupyter Notebook in this article. If you don’t have a notebook you can easily get by typing below code in the command line. Make sure python is installed in your system if not, you can download it by clicking here

The above command will install the Jupiter notebook and you can now write your code in Jupyter notebook. In order to open notebook type the below command in command prompt.

After this, you will see an open notebook in the browser like the below screenshot.

Here you can create a new notebook or can open previous workbooks. Now we will open a new notebook with the below steps.

Click New icon on the right -> Select Python3 Notebook that’s all, New workbook is ready and we can start coding.

We will start by importing the necessary libraries for web scraping. We would need requests, beautifulSoup4.

Now, the time has arrived. Choose your choice of site to scrap and be ready with the URL. Here I am choosing CBC URL to extract the latest political news articles and extract Heading, Description of articles.

requests.get will fetch the page source and we are storing the output in page variable.

Next, we are storing HTML content of the page in Soup object.

now we will be fetching all anchor tags on the webpage by below line of code. “a” is used because anchor tags are used for links in the source page.

Now we are fetching links and we are storing in a list named listofLinks.

Here we are only concerned with extracting political news so we are filtering the above result and extracting only links related to “politics” and again stored in another list named linksrequired.

Now just relax, we are good with links and we need to extract heading, Description from each URL.

After the successful extraction of data, the next step is to load data into DataFrame, to achieve this we used the pandas library.

So we completed our task successfully, we scraped website and extracted heading, description of news articles and stored in pandas Dataframe.

As of now, we are good, and suppose we have a scenario like when we are scraping the website and we need to click some buttons, how we will do it? for the above example, we discussed what if we need to click the LOAD MORE button below to load more articles.

In my next article, I will discuss web drivers so we can click buttons on the webpage.

I hope you all enjoyed reading this article. Please feel to share or comment if there are any mistakes/queries.

Thank you.

#Python#WebScraping#LearningPython

--

--

Vinodnethichinna
Analytics Vidhya

Technical, Enthusiastic and Organized Post Graduation Student with great attention to detail and analytical skills.