Disclaimer:

The content provided in this blog post is for educational purposes only. The techniques and methods discussed are intended to help readers understand the concepts of scraping dynamically loaded content using tools like Selenium and BeautifulSoup.

It is important to note that web scraping can potentially raise legal and ethical concerns. Before attempting to scrape any website, it is crucial to review and adhere to the website’s terms of service, robots.txt file, and any other guidelines or restrictions they may have in place. Respect the website owner’s rights and ensure that your scraping activities align with their policies.

The author and the publisher of this blog post shall not be held responsible for any misuse or unethical use of the information provided. Readers are solely responsible for their own actions and should exercise caution and discretion when engaging in web scraping activities.

By reading and implementing the techniques discussed in this blog post, you acknowledge and agree to the above disclaimer.

Preview of whats to come

Introduction

Dynamically loaded content is content on a web page that is loaded or generated dynamically after the page initially loads. This means that the content is not present in the HTML source code, and is instead added or modified using JavaScript or other similar client-side technologies.

Traditionally web scraping typically involved fetching the HTML source code at your desired URL, and parsing through that returned HTML source code to extract your information. Based on the explanation of dynamically loaded content you can assume that it would be an issue when it comes to traditional web scraping methods, and you’d be right. Due to its nature of being loaded until after the page loads, content loaded dynamically will not show in your scraped html, preventing you from getting all of the data you may be searching for. This is where the combination of Selenium and BeautifulSoup come into play.

Selenium is a popular browser automation framework that allows you to control web browsers programmatically, this means that you can automate different interactions on web pages like, filling forms, clicking buttons, and scrolling. This allows you to simulate any user interactions as well as have a URL opened in a headless browser that will load all dynamically generated content for you to use in your scraper.

BeautifulSoup is a Python library that is used for parsing HTML and XML content, providing convenient methods and syntax that allow you to easily navigate and extract data from your pared HTML. BeautifulSoup is incapable of loading dynamic content, which is why you can use it in conjunction with Selenium to load dynamic content before parsing it with BeautifulSoup.

Setting up ChromeDriver and Chrome for Selenium

When you are working with Selenium for web scraping, it is essential that you have ChromeDriver and Chrome set up on your machine in order to allow Selenium to automate your browser. Note that your setup may vary slightly depending on operating system. Also note that if you are a windows user but work through WSL you will need to install Chrome and ChromeDriver for Linux.

Note: Before getting setup its imperative that you know which version of Chrome you have installed as you will need to install the matching version of ChromeDriver. In order to find your chrome version, follow these steps:

  1. Open Chrome and click on the three-dot menu in the top-right corner
  2. From the dropdown, navigate to “Help” > “About Google Chrome”
  3. Take note of the Version shown

Below are images to follow:

Step 1
Step 2
Step 3

Windows Setup

  1. Download ChromeDriver: You can find the download for the official ChromeDriver here: https://sites.google.com/chromium.org/driver/ make sure you download the ChromeDriver version that corresponds with your Chrome browser version as well downloading the correct processor version (x32, x64)
  2. Extract ChromeDriver: After downloading ChromeDriver through their JSON endpoints, put it in a convenient location on your system, if you would like you could put it in your root-folder where you’ll be writing your scraper.
  3. Add ChromeDriver to PATH: Move the ChromeDriver executable to a directory in your system’s PATH environment variable which will allow you to run it from any location in your terminal. Note: If you put it in your project folder, you will have to configure Selenium in your project to the explicit path of the ChromeDriver executable.

Mac Setup

  1. Install ChromeDriver with Homebrew: Open your terminal application and execute the following command to install ChromeDriver using Homebrew. This command will automatically install the latest stable version of ChromeDriver that is compativle with your installed version of Chrome.
brew install --cask chromedriver

2. Verify the installation: Ensure ChromeDriver was correctly installed by running the following command:

chromedriver --version

Linux / WSL Setup

  1. Download ChromeDriver: You can find the download for the official ChromeDriver here: https://sites.google.com/chromium.org/driver/ make sure you download the ChromeDriver version that corresponds with your Chrome browser version.
  2. Extract ChromeDriver: After downloading ChromeDriver through their JSON endpoints, put it in a convenient location on your system, if you would like you could put it in your root-folder where you’ll be writing your scraper.
  3. Add ChromeDriver to PATH: Move the ChromeDriver executable to a directory in your system’s PATH environment variable which will allow you to run it from any location in your terminal. Note: If you put it in your project folder, you will have to configure Selenium in your project to the explicit path of the ChromeDriver executable.

Note: you may need to make sure that chromedriver has executable permissions, to do so you can run this command with the path to where you stored ChromeDriver:

sudo chmod +x /path/to/chromedriver

Using Selenium to Load Dynamic Content

Now that you have ChromeDriver installed, lets talk about some general basics of using Selenium. Note: we will be going more in-depth with a complete example later on in this blog.

  1. Install Selenium: To start you will need to cd into your projects directory, and run one of the following commands to install it depending on your preference.
Using pip (Python package manager):

pip install selenium

Using pipenv (Python package manager with virtual environment):

pipenv install selenium

Using Anaconda (conda package manager):

conda install -c conda-forge selenium

2. Import the required Modules: While your required modules may vary for your specific scraping tasks, there are some necessary modules from Selenium you’ll need to import, such as ‘webdriver’, ‘Service’ and ‘Options’.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

3. Instantiate the WebDriver: In order to use WebDriver in Selenium, you’ll need to create an instance of the WebDriver by specifying the path to the ChromeDriver executable, which acts as a bridge between your scraper script and the Chrome browser.

options = Options()
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)

4. Configure WebDriver options: Selenium allows you to customize the options for the WebDriver, such as running it in headless mode which means that it doesn’t open a visible browser window. Running in headless mode can be beneficial when it comes to performance, resource optimization, testing and debugging, and allows you to automate web scraping tasks without visual interference, which can be nice for running the script in the background.

options.add_argument("--headless")

5. Load a web page: Use the WebDriver’s get() method to navigate to the desired web page by providing the URL as the argument, this will have Selenium load the page and wait for any dyanamic content on the page to load fully.

url = "https://www.example.com"
driver.get(url)

6. Retrieve the page source with dyanamic content: Now that you’ve loaded the source code for the page with all it’s dynamic content, you can use WebDriver’s page_source attribute and store the HTML source code as a string in a variable to access later.

page_source = driver.page_source

7. Perform cleanup: Now that you’ve retrieved the desired HTML content you can close your WebDriver instance which will close the Chrome browser that Selenium opened

driver.quit()

Extracting Data with BeautifulSoup

  1. Install BeautifulSoup: To start you will need to cd into your projects directory, and run one of the following commands to install it depending on your preference.
Using pip (Python package manager):

pip install beautifulsoup4

Using pipenv (Python package manager with virtual environment):

pipenv install beautifulsoup4

Using Anaconda (conda package manager):

conda install -c anaconda beautifulsoup4

2. Import BeautifulSoup in your scraper.py: In order to do this you can place the following import statement

from bs4 import BeautifulSoup

3. Use Selenium to extract the page source: Since BeautifulSoup requires the HTML for parsing you will have run through Selenium first to acquire the dyanamically loaded source HTML and save it to a variable to pass down to BeautifulSoup.

4. Pass the page source to BeautifulSoup for parsing: To instantiate an instance of BeautifulSoup, you can use BeautifulSoup() and pass in the page source variable and the specific parser method to be used sinceBeautifulSoup supports different parsers. Some parsers include html.parser, lxml, and html5lib. It can be helpful to assign this to a variable to make it easier to reuse throughout your scraper in multiple areas.

soup = BeautifulSoup(page_source, "html.parser")

5. Use BeautifulSoup to locate and extract the desired content: Through BeautifulSoup there are multiple ways to target content within the HTML, whether it be through tag name, class, ID, text content, attribute values, or a combination of them to grab what you specifically need, the methods you use will vary depending on the HTML you’ve gathered so its imperative to understand the structure and use trial and error to grab what you need. Note: it can be very helpful to use a debugger like ipdb to print out what you’re targeting and make sure you’re getting the right content.

# Find an element by tag name
element = soup.find("tag_name")

# Find an element by CSS class
element = soup.find(class_="class_name")

# Find an element by ID
element = soup.find(id="element_id")

# Extract the text content of an element
text_content = element.text

# Extract attribute values from an element
attribute_value = element["attribute_name"]

Putting It All Together

Now that we’ve discussed the basics, I will try my best to break down a scraper I wrote to use in a TCG collection tracker I’m working on right now. If you’d like to view the code for the scraper outside of this blog you can view it on my GitHub here: https://github.com/Evan-Roberts-808/Collection-Tracker/blob/main/server/scraper.py

Note: Since scraping is entirely reliant on the structure of the HTML source code you’re parsing, this scraper you will see will only work with the specific site it was written for, while the methods will be similar for your scraper, you will have to break down your pages source code in order to select your elements properly.

Breakdown:

To start I first import all parts that will be required for the scraper, including Selenium’s WebDriver, BeautifulSoup, ipdb, and other resources like my models for my SQLAlchemy tables.

After our imports we define our Scraper class that acts as a container for all of the scrapers different methods.

Our first method within it is the __init__ method that gets called when an instance of the Scraper class is created, initializing the object and its initial state. This __init__ method takes in two parameters, the chrome_driver_path and base_url. Within our __init__ method, we assign the base_url to the self.base_url attribute, self.cards is initialized as an empty list which is used to store the scraped cards, self.image_directory to represent where in the directory the images will be saved, and self.script_directory which uses the os.path module which defines where the scraper is defined to make it easier to resolve relative paths.

Note: base_url is passed in specifically for this project due to the nature of the sites structure, by instantiating the base_url it made it easier to access the images since the src did not provide the full url.

The get_page method takes a url parameter representing the URL of the page to be fetched. Within our get_page method, there is an instance of the Options class from Selenium being created, we use this to add the --headless option to run the scraper in a browser that is not visible since this scraper doesn’t require any user input. An instance of Service is created which represents the ChromeDriver service, which we pass in our chrome_driver_path attribute. An instance of webdriver.Chrome is created which represents the browser that Selenium will be controlling and within its arguments we pass in our service and options that we declared above. We use this driver to run a .get() method with the provided url to instruct the browser to navigate to the specified URL, the page source is then assigned to the page_source variable where a .quit() is then run to terminate the browser and return the page_source.

Our next method is the download_image method that is responsible for downloading an image from a given URL and saves it at the specific directory we declared in our __init__. This method takes in two parameters: url and filename, The url parameter represents the src URL for the image, and the filename parameter represents what the filename will be, which it’s imperative to this project to have a specific filename pattern for generating the URL we will use in our API to load the images on the front-end.

Within this method we define the local variable filepath that holds the complete file path where the image will be saved, which is constructed using the os.path.join() function which will take in the self.script_directory, self.image_directory, and filename to create the full path.

resquests.get(url) will then send an HTTP GET request to the specified URL to retrieve the image content. response.raise_for_status()then checks the response to see whether or not the image content was successfully retrieved, if not it’ll raise an exception, preventing the method from trying to save an image.

The with open(filepath, 'wb') as file statement opens the file at the specified filepath in binary write mode(‘wb’) which will allow file.write(response.content) to write the content to the file.

The next method is the get_cards method which is responsible for doing the scraping using the URLs provided and creating a Card object for our database using the scraped data. The method takes in a single argument being a list of URLs to parse which uses a for loop to iterate over each to acquire data. I had to use this method since the website being scraped didn’t have any sort of pattern in the URLs so it couldn’t be automated using a pattern to predict the next URL and each had to be manually added to a list.

All of the processes within this method are within a with app.app_context() statement that created a context for the Flask application 'app’ which will make sure our database is properly initialized and accessible. A variable called base_raw_url is also declared which will act as part of the image_url and be concatenated with the filename when the image is saved to create a URL where you can view the image which is stored in our database.

A for loop is then declared which will iterate over each URL in our list and attempt to run each of our processes at each URL. Within this loop is a try: block that’ll stop the processes from going forward if the source HTML isn’t properly pulled from the provided URL. This is where our other methods come into play, we first run our get_page method passing in the current iterated url and assign that url’s source code to a variable called card_html. We then pass that card_html variable in as an argument along with our parsing method html.parser to an instantiation of a BeautifulSoup object and assigning it to a variable called soup to make parsing the HTML easier.

We can now finally start targeting parts of the html and pulling the data we want from them. To help you find specifically what you’re targeting theres a few methods you can use. One of which is to use the dev tools on the web page and use the inspect element selector to select what you want and jump right to it in the HTML in your dev tools. From there you can look through the structure to determine the best way you can grab it. Another method is to use a debugger like ipdb, and setting a trace after you’ve acquired the source code, and within the debugger terminal you can then print out the html, copy it, and paste it into an html document, you may need to run the code through a formatter to make it easier to read. I personally prefer this method since it allows me to search the document with ctrl+f to find specifically what I’m searching for and easily copy and paste the tag name, class name, or attribute name, etc. into my scraper.

Back to the example image, this example is for pulling the title from the page. soup.find() is used to search through the HTML structure represented by the soup object. In this case .find() will search for the first <h1> tag it encounters. .text will then retrieve the text content from that <h1> tag and .strip() will remove any white spice on either end of the extracted text. This text is then assigned to a variable called title, if it’s not found the title will be set to None.

I found it useful to wrap each selector of your scraper in a try: except: block incase that specific URL doesn’t contain what you’re searching for, it wont cause any errors, this was useful in this case since it’s a TCG scraper and some cards have attributes others do not.

The next example of a selector in this scraper is trying to find the cards description. While reading through the HTML I saw that the description is within a <p> tag that is within a <dd> tag with the classes of load-external-scripts image-post_description. To try to make my selection as specific as possible I did a .find() that’ll look for a <dd> tag with specifically those classes. after its found .p is used to retrieve the first <p> tag within the matched <dd> and just like our title .text() and .strip() are used to pull the text content and strip it of any white space at the start or end before its assigned to the description variable.

This one here was a bit of a unique case. In the cards info the element was represented as an image, which was a bit of an issue since I was specifically needing a text label of the cards element. While looking at the <img> tag in the HTML, I noticed the images have a title attribute that matched the element name, which was perfect for what I needed. To grab this I targeted a <dt> tag that specifically had the string of ‘Element(s):’, this is because each of the card details were within a <dt> tag, so I needed to find a way to differentiate each for each selection and using the string within them was a perfect option. From there I used .find_next which would then find the next <img> tag and grab its title attribute with ['title]. That allows me to take the string value from the title attribute and assign it to my element_title variable.

If you’re interested in seeing the other 8 selectors, please view the scraper on my GitHub here: https://github.com/Evan-Roberts-808/Collection-Tracker/blob/main/server/scraper.py with the explanations above hopefully you’ll be able to decipher them all as inspiration for your own scraper.

After all of our selectors have run and assigned their values to the appropriate variables we can then create an object, in this case a Card object, by initializing it with all of the various attributes with their corresponding values that has been scraped. After this Card object is created, it is added to the database session using the db.session.add() command and then commited with db.session.commit() persisting it to the database. The Card object is then appended to the cards list to keep track of which cards have already been processed.

Remember that try: statement at the very top of this method? This is the except to it. If for any reason an error occurs during the scraping that has to kill the scrape, the except will take over and print out a message letting you know that the error occured, and which URL it happened at before attempt the scrape the next URL in the list.

At the very bottom of the code we where the urls list is, this is where the URLs that need to be scraped are placed in order to be passed into the methods. We also initialize the Scraper class with path to the ChromeDriver as well as our base_url which you may remember from when we first wrote the class at the start of this section, this is assigned to a variable called scraper. After that we have the .get_cards method being called on the scraper variable with the urls list being passed in, this is done so when you run the scraper, it’ll run through all the methods you see above and successfully scrape the data.

Considerations and Best Practices

When it comes to scraping there are things to keep in mind, to make sure you’re doing it both successfully and ethically.

  1. Respect ToS: Some websites have it in their ToS stating not to scrape it, if that’s the case please be mindful of that and considering getting the data from elsewhere.
  2. Limit requests to be mindful of the server load: Depending on the level of what you’re scraping it may cause excessive requests to the websites servers, lowering performance of the site and is considered to be unethical. Consider using rate limits or delays between each request to ease your presence, delays can also help emulate human behavior and help prevent you from triggering rate limits.
  3. Store scraped data responsibly: Respect data privacy and security by handling scraped data appropriately. If personal or sensitive information is involved, ensure compliance with applicable laws and regulations.
  4. Monitor and adapt: Websites may change their structure, styling, etc. which may cause your scraper to stop working in the future, it may be beneficial if you plan on needing your scraper again, to run it through a debugger and make sure everything is still properly targeted before running it and pulling unnecessary data / causing it to not work properly

--

--