The Art of Web Scraping in Python using Selenium

Sadiquetimileyin
Data Epic
Published in
6 min readDec 5, 2023
Scraping
Scraper

What is web scraping? Web scraping is the process of extracting data from websites. It is a very potent method that transforms data gathering and processing. Web scraping has become a vital tool for both consumers and corporations due to the abundance of internet data.

Selenium is a Python library that can be used to perform automated testing on web applications. It can simulate user inputs, such as entering content, tapping buttons, and scraping sites. Selenium is an open-source tool that works on all major operating systems and browsers.

Why use Selenium for web scraping?

Python offers a large number of modules and frameworks that make it simple to extract data from websites, which makes it a popular programming language for web scraping.

Using Python and Selenium for web scraping offers several advantages over other web scraping techniques:

  1. Dynamic websites: JavaScript and other scripting languages are used to build dynamic web sites. When the website loads completely or the user interacts with it, these sites frequently have elements that are visible. Because Selenium can communicate with various components, it’s an effective tool for data scraping from dynamic websites.
  2. User interactions: Selenium is capable of simulating scrolling, clicks, and form submissions. This enables you to scrape websites like login forms that demand input from users.
  3. Debugging: You may step through the scraping process with Selenium in debug mode, seeing what the scraper is doing at each stage. When something goes wrong, you can use this to troubleshoot it.(reference nanonets.com)

Setting up selenium:

First and foremost, install the selenium driver using pip either on your local terminal or directly in your IDE

pip install Selenium

then, we install a web-driver manager. The web-driver has a common problem of needing to manually update the web-driver according to the version of chrome your are using so in this article we will cover ways to automatically update the chrome driver. This will automatically download and install the appropriate WebDriver for you. To install web driver-manager, you can use the following command:

pip install webdriver-manager

Setup complete.

Moving forward, step by step setup for selenium

Step 1: The imports

The imports below is the basic import needed when initializing selenium. we will be focusing mainly on using Chrome browser

from selenium import webdriver
from Selenium.webdriver.common.keys import Keys
from Selenium.webdriver.common.by import By

Step 2: Web-driver manager

Setting up web-driver manager on Chrome browser use the following imports:

from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

The code below has a function that auto update your chrome driver automatically regardless of which version of chrome you are using. Using the code below:

 drivers = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Step 3: Basic Selenium functions

  1. The get() function, it is used to access the website after the link has been provided as shown below:
link = "htpps://google.com" # This is just a reference 
driver.get(link)

2. The find_element() and find_elements(). The difference between this two functions is the find_element() returns the first find on the webpage that has the selector e.g h1 the find_element returns the first h1 on the webpage. Meanwhile the find_elements() fetches all the h1’s on the page and we use a for loop to access the data gotten from the find_elements(). implementation below:

# For the find element
driver.find_element(By.TAG_NAME, value='h1')

# For find elements
driver.find_elements(By.CSS_SELECTOR, value='h1')

# Don't worry about the By selector if it looks like mystery now dont't worry we will cover it in this article

This functions above is mainly the three functions you will need for basic selenium usage.

The For loop solution as stated above the for loop basically stores the output directly into a list.

# For links
links = driver.find_elements(By.CSS_SELECTOR, value='.directory-content-box-inner a')
linked = []
for link in links:
linked.append(link.get_attribute('href'))

# For texts
element = driver.find_elements(By.TAG_NAME, value='h3')
elements = []
for txt in element:
elements.append(txt.text)

Step 4: Navigating using the By selector

The By selector consists of 8 major selector types which are: NAME, ID, CSS_SELECTOR, TAG_NAME, CLASS_NAME, XPATH, LINK_TEXT, PARTIAL_LINK_TEXT. The following are various methods to navigate through the website.

<input type="text" name="passwd" class="passwd-class" id="passwd-id" /> # Example of a basic html code

By.ID:
This method is used when the id of a html code is given as shown below.

element = driver.find_element(By.ID, value="passwd-id")

By.NAME:
This method is used when the name is indicated in the html code as shown below

element = driver.find_element(By.NAME, value="passwd")

By.CSS_SELECTOR:
This method requires using css tag names to search through a webpage to select a specific part of the code that may have similarities with the other parts of the code.

element = driver.find_element(By.CSS_SELECTOR, value="input#passwd-id")

By. XPATH:
When using this method we can write the XPATH manually but we have to be very careful the best practice is to copy directly from your web inspect page the explanation of how XPATH works is here

Chrome web inspect page
How to copy XPATH
element = driver.find_element(By.XPATH, value="//input[@id='passwd-id']")

By.CLASS_NAME:
This method is selecting the part of the html code by the name of the class indicated as shown below.

element = driver.find_element(By.CLASS_NAME, value="passwd-class")

By.TAG_NAME:
This method is selecting the part of the html code by the name of the tag as shown below. Examples of tag names are h1, a, e.t.c.

element = dirver.find_elements(By.TAG_NAME, value="input")

By.LINK_TEXT:
This method is using the texts that are attached to anchor tags in html as shown in the image below.

Selection by link txt
Explanatory image for link text
element = driver.find_element(By.LINK_TEXT, value="click here")

By.PARTIAL_TEXT:
This is also the same as above but this does not require the full name of the link text just a bit of it is enough for it to run

element = driver.find_element(By.PARTIAL_LINK_TEXT, value="here")

Step 5: Viewing texts in python IDE

This step is necessary incase we want to save any data gotten from the website in a file or just a list and so on we use the following to pick outputs from the website.

  1. The text keyword at the end of the element is used to allow python decipher the coded output directly from the website then convert it to a text understood by the user. Example is shown below.
element = driver.find_element(By.TAG_NAME, value='h1')
element_name = element.text

2. The get attribute function is used to mostly print out links from an anchor tag using the “href” keyword.

element = driver.find_element(By.TAG_NAME, value='a')
link = element.get_attribute('href')

The code above will print out the links found in the anchor tags

Step 6: Saving into a file

Saving all the output into a file we can always use the csv file format preferably because it is easier to work with in any other aspect especially during data cleaning.

    with open("Canada_companies.csv", 'a') as file:
files = file.write(f"{element_name},{link}\n")

Just an addition, for faster loading times we add the headless option this make the web-driver not need to open the website instead it open is in the background and it does all the scraping for you without opening the webpage.

def chrome_driver():
option = webdriver.ChromeOptions()
option.add_argument('--headless')
drivers = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=option)
return drivers

Conclusion:

To sum up, using Selenium for web scraping is an effective way to retrieve data from websites. It can save you a lot of time and effort by enabling you to automate the data collection process. With Selenium, you may interact with websites in a manner similar to that of a human user and retrieve the necessary data more quickly.

Thank you for reading, more to come!

--

--