Web Scraping using Beautiful Soup

Published in

Analytics Vidhya

3 min readMay 5, 2020

Image Source: https://hackernoon.com/web-scraping-bf2d814cc572

Web Scraping or web data extraction is used to automate the process of extracting data from websites. Web scraping saves you from the trouble of manually downloading or copying any data.

In this post, we are directly going to jump to code. For further information and applications of web scraping, check out this post.

We are going to extract the names and website links of the bloggers from this site.

Before diving into coding, we need to head over to the site we want to scrape. Developer tools can help you understand the structure of a website. You can also access them by right-clicking on the page and selecting the Inspect option in chrome. This will display the HTML content of the page.

Step 1: Importing required libraries.

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
import csv
import re

Step 2: Parse HTML Code With Beautiful Soup

url=’https://indianbloggers.org/'
content = requests.get(url).text
soup = BeautifulSoup(content, ‘html.parser’)

Step 3: Find Elements

You can find the element using an ID or HTML class name using soup.find()and soup.find_all()attributes. Here I am saving the extracted data in blog_list.csv file with headers ‘name’ and ‘ web link’.

with open(‘Blog_list.csv’,’w’,newline=’’) as file:
 
 fieldnames = [‘name’,’web link’] 
 writer = csv.DictWriter(file, fieldnames=fieldnames)
 writer.writeheader()   for link in soup.find_all(‘a’,):    if (len(link.text.strip()) > 1 and  
   bool(re.match(‘^http’,link[‘href’])) and not 
   bool(re.search(‘indianbloggers|twitter|facebook’,link[‘href’])):     data[‘title’].append(link.text)
    data[‘links’].append(link[‘href’])
    writer.writerow({‘name’:data[‘title’], ‘link’:data[‘links’]}) 
    #finding type of blog 
    if re.search(‘blogspot’,link[‘href’]): 
     poll[‘blogspot’]+=1 
    elif re.search(‘wordpress’,link[‘href’]): 
     poll[‘wordpress’]+=1 
    else: 
     poll[‘others’]+=1 blog_list = pd.DataFrame(data).set_index(‘title’)
print(blog_list.head())
blog_list.to_csv(‘blog_list.csv’, encoding=’utf-8')
print(str(len(blog_list.index))+’ rows written’)
print(poll)

Here is a snapshot of the blog_list.csv file.

You can also find web scraping for this website, collecting recipe name, ingredients, and recipe links here.

This post is co-authored by Sameer Ahire.

References

Beautiful Soup: Build a Web Scraper With Python - Real Python

In this tutorial, you'll walk through the main steps of the web scraping process. You'll learn how to write a script…

realpython.com

| Search Results | SimplyRecipes.com | Page 1

An organized kitchen doesn't have to be super minimal; it just has to function well. These editor-approved kitchen…

www.simplyrecipes.com

How Web Scraping is Transforming the World with its Applications

Guess what’s common between an entrepreneur envisaging a new start-up, CEO of a Fortune 500 company, an equity analyst…

towardsdatascience.com

The Best Indian Bloggers

Directory of most popular blogs in India. You can meet some of the best Indian Bloggers here and even add your own blog…

indianbloggers.org