Web Scraping using Beautiful Soup

Shweta Pardeshi
Analytics Vidhya
Published in
3 min readMay 5, 2020
Image Source: https://hackernoon.com/web-scraping-bf2d814cc572

Web Scraping or web data extraction is used to automate the process of extracting data from websites. Web scraping saves you from the trouble of manually downloading or copying any data.

In this post, we are directly going to jump to code. For further information and applications of web scraping, check out this post.

We are going to extract the names and website links of the bloggers from this site.

Before diving into coding, we need to head over to the site we want to scrape. Developer tools can help you understand the structure of a website. You can also access them by right-clicking on the page and selecting the Inspect option in chrome. This will display the HTML content of the page.

Site URL: https://indianbloggers.org/

Step 1: Importing required libraries.

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
import csv
import re

Step 2: Parse HTML Code With Beautiful Soup

url=’https://indianbloggers.org/'
content = requests.get(url).text
soup = BeautifulSoup(content, ‘html.parser’)

Step 3: Find Elements

You can find the element using an ID or HTML class name using soup.find()and soup.find_all()attributes. Here I am saving the extracted data in blog_list.csv file with headers ‘name’ and ‘ web link’.

with open(‘Blog_list.csv’,’w’,newline=’’) as file:

fieldnames = [‘name’,’web link’]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for link in soup.find_all(‘a’,): if (len(link.text.strip()) > 1 and
bool(re.match(‘^http’,link[‘href’])) and not
bool(re.search(‘indianbloggers|twitter|facebook’,link[‘href’])):
data[‘title’].append(link.text)
data[‘links’].append(link[‘href’])
writer.writerow({‘name’:data[‘title’], ‘link’:data[‘links’]})
#finding type of blog
if re.search(‘blogspot’,link[‘href’]):
poll[‘blogspot’]+=1
elif re.search(‘wordpress’,link[‘href’]):
poll[‘wordpress’]+=1
else:
poll[‘others’]+=1
blog_list = pd.DataFrame(data).set_index(‘title’)
print(blog_list.head())
blog_list.to_csv(‘blog_list.csv’, encoding=’utf-8')
print(str(len(blog_list.index))+’ rows written’)
print(poll)

Here is a snapshot of the blog_list.csv file.

You can also find web scraping for this website, collecting recipe name, ingredients, and recipe links here.

This post is co-authored by Sameer Ahire.

References

--

--

Shweta Pardeshi
Analytics Vidhya

Master's student at UCSD | Educative Author | 35k+ views on Medium | Analytics Vidhya Author | IIT Gandhinagar | https://www.buymeacoffee.com/shwetapar1