Discover 3 Top Methods for Scraping Webpages with XPath — A Step-by-Step Guide with Code Examples

Discover the process of getting XPath expressions and how to use the top three methods: BeautifulSoup, Selenium, and Scrapy, with code examples.

5 min readFeb 6, 2023

Web scraping is a technique for extracting information from websites. This information can be used for a variety of purposes, such as data analysis, content aggregation, or even price comparison.

One of the most powerful tools for web scraping is XPath, a query language for selecting elements from an XML document, including HTML. XPath allows you to select elements based on their tag name, attributes, and position in the document tree. In this article, we’ll explore how to get XPath expressions, and how to use the top three methods for scraping webpages that support XPath: BeautifulSoup, Selenium, and Scrapy.

Getting XPath Expressions

Most modern browsers have built-in developer tools that allow you to inspect the HTML of a webpage and obtain the XPath expression of an element. To get the XPath expression in Google Chrome, for example, right-click on the element and select “Inspect” or press Ctrl + Shift + I, then right-click on the HTML element in the "Elements" tab and select "Copy > Copy XPath".

There are many online XPath tools available that allow you to obtain the XPath expression of an element on a webpage by clicking on the element.

XPath Helper (Chrome Extension): A Chrome browser extension that provides an easy-to-use interface for testing XPath expressions.
XPath Tester: A simple online tool that allows you to enter an HTML or XML document and a desired XPath expression to see the selected elements.
Scrapy Shell: The Scrapy shell is a command-line tool that provides an interactive shell for testing XPath expressions. It allows you to load an HTML or XML document and test XPath expressions in real-time.
Online XML Tools: A website that provides a variety of XML tools, including an XPath tester.

Scraping Webpages with BeautifulSoup

BeautifulSoup is a popular and widely used Python library for web scraping, and it provides support for searching the parse tree using XPath expressions. Here is an example of how to use BeautifulSoup to scrape a webpage:

from bs4 import BeautifulSoup
import requests

# Make a GET request to the website
response = requests.get('https://example.com')

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'lxml')

# Use XPath to select the desired element(s)
results = soup.select('your xpath expression')

# Access the element(s) content
for result in results:
    print(result.text)

In this example, we first make a GET request to the website using the requests library. Then, we create a BeautifulSoup object and use the select method to select the desired element(s) based on the XPath expression. Finally, we access the content of the element(s) by printing the text property.

Scraping Webpages with Selenium

Selenium is a browser automation tool that provides a method for searching the HTML of a webpage using XPath expressions. It is often used for more complex scraping tasks that require interaction with a webpage, such as filling out forms or clicking buttons. Here is an example of how to use Selenium to scrape a webpage:

from selenium
import time
from selenium import webdriver

# Initialize the webdriver
driver = webdriver.Chrome()

# Navigate to the website
driver.get("https://example.com")

# Wait for the page to load
time.sleep(3)

# Use XPath to select the desired element(s)
results = driver.find_elements_by_xpath("your xpath expression")

# Access the element(s) content
for result in results:
    print(result.text)

# Close the webdriver
driver.quit()

In this example, we first initialize the Chrome webdriver using the webdriver.Chrome() method. Then, we navigate to the desired website using the get method. We use the find_elements_by_xpath method to select the desired element(s) based on the XPath expression. Finally, we access the content of the element(s) by printing the text property and close the webdriver using the quit method.

The with keyword can be used with the Selenium WebDriver to automatically close the driver when it is no longer needed. The with statement is used to wrap a block of code and ensures that resources are properly cleaned up after the block is finished executing. When using Selenium with the with statement, the WebDriver is automatically closed and any open connections are closed when the with block is exited.

Here’s an example of how you can use the with statement to quit the Selenium WebDriver:

from selenium import webdriver

with webdriver.Firefox() as driver:
    driver.get("https://www.example.com")
    # Perform actions on the website

# The driver is automatically closed when the with block is exited

In this example, a Firefox WebDriver is created and used to navigate to the website “https://www.example.com". The with statement ensures that the driver is automatically closed when the block of code is finished executing. This can be useful for ensuring that resources are properly cleaned up, and it helps prevent memory leaks or other resource-related issues.

Scraping Webpages with Scrapy

Scrapy is a Python framework for large-scale web scraping that provides a method for searching the HTML of a webpage using XPath expressions. It is designed to handle complex scraping tasks and can be used to extract data from multiple pages. Here is an example of how to use Scrapy to scrape a webpage:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [
        'https://example.com',
    ]

    def parse(self, response):
        # Use XPath to select the desired element(s)
        results = response.xpath('//your/xpath/expression')

        # Access the element(s) content
        for result in results:
            yield {'text': result.get()}

In this example, we create a Scrapy spider named `ExampleSpider` and define a `parse` method that selects the desired element(s) based on the XPath expression and yields the content of each element as a dictionary.

To run the Scrapy spider, you will need to use the following command in your terminal or command prompt:

scrapy runspider spider_name.py

Replace “spider_name” with the name of your Scrapy spider file (e.g. “example.py” in the code example provided in the article). This command will start the spider and begin the web scraping process. The scraped data will be outputted to the terminal or command prompt. If you want to save the data to a file, you can use Scrapy’s built-in feed export functionality.

Conclusion

In this article, we covered how to get XPath expressions, and how to use the top three methods for scraping webpages that support XPath: BeautifulSoup, Selenium, and Scrapy. Each method has its own strengths and weaknesses, so you should choose the one that best fits your specific scraping needs. Whether you are scraping a single page or multiple pages, using XPath expressions will greatly simplify the process and allow you to extract the desired information from the HTML of a webpage.

# XPath: The Harmonious Hymn of Web Scraping's Might

XPath, the tool for web scraping delight
Data extraction, done with ease and insight
Top three methods, to lead the way
Code examples to make your work, A-Okay

Expressions key, with precision and grace
Elements selected, with just the right pace
BeautifulSoup, Selenium, Scrapy, oh so bright
Web scraping projects, done with all your might!

Thank you for reading! If you enjoyed this article, please consider following me on Medium for more content like this. I regularly write about machine learning, natural language processing, and cloud security using Python, React, and AWS. Your support will help me continue to create valuable and engaging content in these areas, and I would be honored to have you as a follower. Thank you!