Web Scraping with Python

How to scrape data from a website and dump it into a CSV

Rahul Kapoor

Published in

Analytics Vidhya

3 min readApr 14, 2021

Photo showing random web data — Photo by Markus Spiske on Unsplash

Web Scraping is the process of gathering information from the internet.

Note: If you are scraping a page that is out there on the World Wide Web just for educational purposes, then it seems fine. But still, you should consider checking their Terms of service as few websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Let me give you an easy example of where it can be used. Say you want to buy a popular product from a website that goes out of stock as soon as it comes up. One way would be to visit the website, do some filtering, check the availability of your product on daily basis.

For that, you have to spend a few minutes daily so that you can grab it the next time it pops up on the website. Another way could be that instead of looking up daily you can automate it with python web scraping.

It is very helpful to automate your task of looking out on the web, expand your search to any number of pages you want on the site, etc.

Scraping a Test Page

link

The requests module allows you to send HTTP requests using Python.

The HTTP request returns a Response Object with all the response data (content, encoding, status, and so on).

import requests

res = requests.get('link to your web page')

Now we will be using a library called BeautifulSoup in Python to do web scraping.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

To read more about its features you can check out: link

If you don't have much time to check out its documentation and its features just get one thing clear: BeautifulSoup can parse anything on the web you give it.

A simple example of scraping the title from the test page.

from bs4 import BeautifulSoup

page = requests.get("https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")soup = BeautifulSoup(page.content, 'html.parser')title = soup.title.text # gets you the text of the <title>(...)</title>

Yes, it is that easy.

Now you can try scraping the head, body, title, etc. on your own before we deep dive into scraping products and storing them into a CSV.

Github link to the repository.

import requests
from bs4 import BeautifulSoup
import csv

res = requests.get('https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/')

soup = BeautifulSoup(res.content, 'html.parser')

Now we can use the soup variable properties to start scraping data.

all_products = []

products = soup.select("div.thumbnail")for product in products:
    name = product.select("h4 > a")[0].text.strip()
    price = product.select("h4.price")[0].text.strip()
    description = product.select("p.description")[0].text.strip()
    reviews = product.select('div.ratings')[0].text.strip()
    image = product.select("img")[0].get("src")

    all_products.append({
        "product_name": name,
        "price": price,
        "description": description,
        "reviews": reviews,
        "image": image
    })

columns = all_products[0].keys()
with open("my_data.csv", "w", newline="") as file:
    dict_writer = csv.DictWriter(file, columns)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)

Let me explain what just happened above.

We can see that all the product details are present in a div with a class name thumbnail. Mentioned below is the HTML representing each product.

<div class=”thumbnail”> <img alt=”item” class=”img-responsive” src=”/webscraper-python-codedamn-classroom-website/cart2.png”/> <div class=”caption”> <h4 class=”pull-right price”>$1139.54</h4> <h4> <a class=”title” href=”/webscraper-python-codedamn-classroom-website/test-sites/e-commerce/allinone/product/593" title=”Asus AsusPro Advanced BU401LA-FA271G Dark Grey”>Asus AsusPro Adv…</a> </h4> <p class=”description”> Asus AsusPro Advanced BU401LA-FA271G Dark Grey, 14", Core i5–4210U, 4GB, 128GB SSD, Win7 Pro 64bit, ENG </p> </div> <div class=”ratings”> <p class=”pull-right”>7 reviews</p> <p data-rating=”3"> <span class=”glyphicon glyphicon-star”></span> <span class=”glyphicon glyphicon-star”></span> <span class=”glyphicon glyphicon-star”></span> </p> </div>

name: Inside <h4> followed by <a> tag. So we used product.select(“h4 > a”)[0].text.strip()

price: Inside <h4> with class price

description: Inside <p> with class description

reviews: Inside <div> with class ratings

image: Inside <img> with attribute src

Then there is code to write our data appended into all_products list to a CSV. This is done with help of the csv module in python.

Let me know if you face any difficulties in the comments.

Web Scraping with Python

How to scrape data from a website and dump it into a CSV

Scraping a Test Page

Written by Rahul Kapoor