Web Scraping Using Python (Selenium)

Published in

GatorHut

5 min readAug 1, 2023

Web Scraping, also known as “Crawling” or “Spidering,” is a technique for web harvesting, which means collecting or extracting data from websites. Here, we use bots to extract content from HTML pages and store it in a database (or CSV file or some other file format). Scraper bot can be used to replicate entire website content, owing to which many varieties of digital businesses have been built around data harvesting and collection.

Python Web Scraping can help us extract an enormous volume of data about customers, products, people, stock markets, etc. The data has to be put to ‘optimal use’ for the betterment of the service. There is a debate whether web scraping is legal or not, the fact is that web scraping can be used for realizing legitimate use cases.

Applications of Web Scraping

Sentiment analysis: While most websites used for sentiment analysis, such as social media websites, have APIs, which allow users to access data, this is not always enough. In order to obtain data in real-time regarding information, conversations, research, and trends it is often more suitable to web scrape the data.

Market Research: E-Commerce sellers can track products and pricing across multiple platforms to conduct market research regarding consumer sentiment and competitor pricing. This allows for very efficient monitoring of competitors and price comparisons to maintain a clear view of the market.

Technological Research: Driverless cars face recognition, and recommendation engines all require data. Web Scraping often offers valuable information from reliable websites and is one of the most convenient and used data collection methods for these purposes.

Machine Learning: While sentiment analysis is a popular machine-learning algorithm, it is only one of many. One thing all machine-learning algorithms have in common, however, is the large amount of data required to train them. Machine learning fuels research, technological advancement, and overall growth across all fields of learning and innovation. In turn, web scraping can fuel data collection for these algorithms with great accuracy and reliability.

How to perform Web Scraping using Selenium and Python

Pre-Requisites:

· Set up a Python Environment.

· Install Selenium v4. If you have conda or anaconda set up, then using the ! pip package installer would be the most efficient method for Selenium installation. Simply run this command (on anaconda prompt, or directly on the Linux terminal):

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

But wait a minute what should I fetch that is the main question I guess you also have right?

When I learnt NLP and Web Scraping techniques, I was curious to do some interesting stuff so in this blog we will see how to scrap reviews from IMDB website and the movie I have chosen “Oppenheimer”.

So let us call the url first

reviewss={'reviw':[]}
driver = webdriver.Chrome()
driver.get('https://www.imdb.com/title/tt15398776/reviews?ref_=tt_urv')

In the above code, first I have created a variable called reviews and store reviews in it.

In Selenium, webdriver.Chrome() is a constructor that creates an instance of the Chrome WebDriver, which is a tool that enables interaction with the Chrome web browser. The Chrome WebDriver acts as a bridge between your Python code and the Chrome browser, allowing you to automate various tasks, such as navigating to web pages, filling out forms, clicking buttons, and extracting data. And .get() land on IMDB website review page.

When we run the above code, it navigates us to the IMDB page of the movie “Oppenheimer,” and we are ready to start scraping reviews. However, we need to consider that in one page, there may not be enough reviews for our analysis. Typically, a page might have around 10 to 15 longer reviews, or even up to 50 smaller reviews. This may not be sufficient for sentiment analysis. To fetch more reviews, we need to click the “Load More” button on the website, which will dynamically load additional reviews without refreshing the entire page. By doing so, we can gather a more substantial dataset for our sentiment analysis

def is_load_more_visible():
    return EC.visibility_of_element_located((By.XPATH, '//button[@class="ipl-load-more__button"]'))

max_button_clicks = 100
button_clicks = 0

Mmmmmmm may be you are wondering what is this Xpath? Let me explain the importance of Xpath.

XPath is a language used to locate elements within an XML or HTML document. It is crucial for precise and dynamic element selection in web scraping and automation tasks. It enables efficient navigation, filtering, and interaction with web elements across different browsers, making it a powerful tool for data extraction and automation.

while button_clicks < max_button_clicks:
    try:
        load_more_button = WebDriverWait(driver, 10).until(is_load_more_visible())
        load_more_button.click()
        time.sleep(2)  
        button_clicks += 1
    except:
        break

Here I have used a loop to click the “Load More” button multiple times to load additional content on a webpage. The purpose is to fetch more reviews for analysis. It waits for the button to be visible (up to 10 seconds) and then clicks it. The loop continues until either the maximum number of button clicks (max_button_clicks) is reached or an exception occurs. The time.sleep(2) ensures a short pause after each click to allow the new content to load.

We are almost here just wait few more seconds then we can see magic😀

review_elements = driver.find_elements(By.XPATH, '//div[@class="text show-more__control"]')

for i in review_elements:
    reviewss['reviw']=reviewss['reviw']+[i.text]

driver.quit()

review_elements = driver.find_elements(By.XPATH, ‘//div[@class=”text show-more__control”]’)

This line finds all the review elements on the webpage by using the XPath expression

‘//div[@class=”text show-more__control”]’

and stores them in the review_elements list.

For i in review_elements:

This line starts a loop that iterates through each review element in the review_elements list.

reviewss[‘reviw’] = reviewss[‘reviw’] + [i.text]

This line appends the text content of each review element I to the list reviewss[‘reviw’], which is a part of the reviewss dictionary. It accumulates all the review texts.

driver.quit ()

This line closes the WebDriver, which means it closes the web browser that was being controlled by Selenium after the reviews have been fetched.

Sounds Scary Right?? Well I would suggest you to watch few videos or blogs how to get the Xpath from any website this is the main thing you will need, rest codes you can take reference from Google or ChatGpt/Bard but remember don’t copy everything learn the technique and logic behind it.

As you can see we are successfully fetch all the reviews from the IMDB website and ready to do further analysis.

In my next blog I will briefly explain how we can do Sentiment analysis by using these reviews.

Till then Best of luck, Discovering Insights, Empowering Decisions!!

Web Scraping Using Python (Selenium)

Applications of Web Scraping

How to perform Web Scraping using Selenium and Python

Written by Dipam Sarkar (idatawiz)