Web Scraping : A Step-by-Step Guide using Selenium and Muicircular

Prakruti Pathak
4 min readJan 12, 2024

--

Introduction

In this blog post, we will walk through a step-by-step procedure to perform web scraping using Selenium, a powerful browser automation tool, and Muicircular, a Chrome extension. Our target website is Magicpin, and we aim to scrape data related to electronics in New Delhi. The Python script provided utilizes Selenium for navigating the website and Muicircular for efficiently waiting for content to load. The scraped data is then extracted and stored in a CSV file for further analysis.

Before we dive into the code, ensure you have the following set up:

  1. Python installed on your system.
  2. Download ChromeDriver .

Download ChromeDriver

  1. Determine Your Chrome Browser Version:
  • Open Google Chrome.
  • Click on the three vertical dots in the top-right corner (menu icon).
  • Navigate to “Help” > “About Google Chrome.”
  • Note down the version number.
Version 120.0.6099.201

2. Visit the ChromeDriver Download Page:

3. Download the Appropriate Version:

  • Download the version that matches your Chrome browser version and your operating system (Windows, macOS, Linux).

4. Extract the ChromeDriver Zip File:

  • Once the download is complete, locate the downloaded zip file.
  • Extract the contents of the zip file to a folder of your choice.

5. Set the ChromeDriver Path in Your Selenium Script:

  • In your Selenium script, set the path to the ChromeDriver executable.
  • Modify the driver_path variable in your script to point to the location where you extracted ChromeDriver.

Install this libraries

Install Selenium and Beautiful Soup using the following command:

pip install selenium
pip install beautifulsoup4

Setting Up the Environment

The script begins by importing necessary libraries and defining functions for setting up the Chrome WebDriver, waiting for the loading spinner, and extracting href links.

from selenium import webdriver
import csv
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time


def setup_driver(driver_path):
"""
Set up the Chrome WebDriver with options and return the driver.
"""
options = webdriver.ChromeOptions()
options.add_argument("executable_path=" + driver_path)
driver = webdriver.Chrome(options=options)
return driver


def wait_for_spinner(driver):
"""
Wait for the loading spinner to disappear.
"""
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'MuiCircularProgress-root'))
WebDriverWait(driver, 10).until_not(element_present)
except Exception as e:
print(f"An error occurred: {e}")

def extract_href_links(driver, base_xpath):
"""
Extract href links from the specified base_xpath without a limit.
"""
href_links = []
i = 1

while True:
xpath = f'{base_xpath}[{i}]/div[1]/div/a'

try:
href_element = driver.find_element(By.XPATH, xpath)
href_link = href_element.get_attribute("href")
href_links.append(href_link)
i += 1
except NoSuchElementException:
# Break the loop when no more elements are found
break
return href_links

Scraping Data and Writing to CSV

The script then iterates through the extracted href links, navigates to each product page, and scrapes relevant information. The data is written to a CSV file.

  • To find the element and class name from a webpage.
  • Right-click on the element (e.g., the product name) you want to extract information from.
  • Select “Inspect” from the context menu.
def scrape_and_write_data(driver, href_links, csv_file_path):
"""
Scrape data from each href link and write it to a CSV file.
"""
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Product Name","Product_type","Product Locality","Product Rating", "Merchant Visits", "Reviews"])

for link in href_links:
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'html.parser')

product_name = soup.find('h1', class_='v2').text.strip() if soup.find('h1', class_='v2') else ""
product_star = soup.find('p', class_='merchant-establishment hide-mb').text.strip() if soup.find('p', class_='merchant-establishment hide-mb') else ""
product_loc = soup.find('a', class_='merchant-locality').text.strip() if soup.find('a', class_='merchant-locality') else "
product_rating = soup.find('p', class_='rating-desc').text.strip() if soup.find('p', class_='rating-desc') else ""
merchant_visits = soup.find('div', class_='merchant-visits').text.strip() if soup.find('div', class_='merchant-visits') else ""
reviews = ' '.join([review.text.strip() for review in soup.find_all('p', class_='review')])

writer.writerow([product_name,product_star, product_loc,product_rating, merchant_visits, reviews])

Navigating to the Website and Extracting Links

Next, the script initializes the WebDriver, opens the Magicpin website, and waits for the loading spinner to disappear.

  • To obtain the base_xpath for a specific element on a webpage, you can use browser developer tools to inspect the HTML structure. The base_xpath is essentially the unique path to the HTML element you want to target.
  • Select “Copy XPath” to copy the full XPath to the clipboard.
To get XPath

def main():
url = "https://magicpin.in/india/New-Delhi/All/Electronics/"
driver_path = "C:/Users/yourname/Downloads/chromedriver-win64/chromedriver"
base_xpath = '//*[@id="react-around-search-results"]/main/section/div[1]/article'
csv_file_path = "electronic.csv"

# Set up the driver
driver = setup_driver(driver_path)

# Open the website and wait for spinner to disappear
driver.get(url)
time.sleep(60)
wait_for_spinner(driver)

# Extract href links
href_links = extract_href_links(driver, base_xpath)

# Scrape and write data to CSV
scrape_and_write_data(driver, href_links, csv_file_path)

# Quit the driver
driver.quit()

print(f"Data has been successfully stored in '{csv_file_path}'.")


if __name__ == "__main__":
main()

--

--