Selenium logo

Lessons Learnt While Using Selenium For Scraping

Manjunath Hegde
CivicDataLab
Published in
8 min readMar 8, 2023

--

Here is a TL;DR version of the article.

  1. Use Selenium only for JS-enabled sites. For static HTML pages, BeautifulSoup or Scrapy is enough
  2. Use web drivers in headless mode to save RAM.
  3. Avoid using multiple driver instances when the site is paginated.
  4. Keep mouse movements on the driver to the minimum extent possible.
  5. Avoid using absolute XPaths.
  6. Use webdriverwait instead of sleep if you need to wait for the page to load.
  7. Try to implement a retry mechanism in the scraper for seamless scraping.

Web scraping has become a go-to method for data enthusiasts to source/procure data from the web. As the data on the web is seeing humongous growth day by day, it’s very important for a data enthusiast to know what tools to be used for scraping it from the web. This blog contains some of the lessons that I learnt while using one such popular tool for scraping — Selenium. This list can grow as one goes on discovering new challenges while scraping, but as a beginner, this list should be enough to start with Selenium.

  1. Why selenium

The first question that should come to mind before choosing selenium (or any tool for that matter) is why selenium? If the website one wants to crawl is static(i.e. Built using just plain HTML and CSS), Selenium is the worst choice. Beautifulsoup (bs) or Scrapy can do the job efficiently here. In fact, Selenium is not made for scraping, it’s a testing tool after all. It’s a jugaad used by engineers to scrape Javascript-enabled sites. So choose Selenium if the site uses JS to load. Scrapy/bs can’t get the response if the site is using JS to render itself. How to test that? Simple — Use a tool like postman to make a request to the site URL and verify if the HTML in the response you are getting is the same as the one you see in your browser’s dev tools. If not, get Selenium. Let’s consider an example where we will fetch datasets from https://data.gov.in/.

Response of a js-enabled webpage in a browser
https://data.gov.in — devtools

In the screenshot above you can see how the elements are highlighted in devtools. But, when we hit the same URL with postman, and search for ‘Road Accidents in India 2019’ in the response received, we can see that though the URL returned some response (200 response code), it is not exactly what we need.

response of a js-enabled webpage in postman
https://data.gov.in — response in Postman

In such cases, Scrapy/BeautifulSoup can be of no help but we can consider using Selenium.

2. Choosing a driver

Once you choose selenium to scrape data, you have to have a driver (a browser). Selenium supports almost all the browsers like chrome (chrome driver), Firefox (firefox driver), Safari etc (here is the list). Use any of the above-stated drivers, in a headless mode. When you use a browser in headless mode, the browser’s UI won’t load thereby reducing the load on the machine. Remember — chrome is notorious for using a lot of RAM. It is also important to note that, while deploying the scraper in a server, say, an ec2 instance, it is essential to run the scraper in headless mode. Hence the lesson is to;

Use any of the above-stated drivers in headless mode.

3. Handling multiple pages

In most cases, the data would be spread across multi-level pages [Main page -> Sub page -> some more pages…]. One should navigate within the site in order to fetch data. Let’s consider another example using data.gov.in¹ . If we want to fetch all the resources from the first two catalogs listed here, we need to click on the first catalog.

selecting a catalog in data.gov.in
Selecting a catalog

Now we see that the selected catalog is paginated. In order to mine the required data points we need to navigate through all these pages to retrieve datasets.

Scenario depicting pagination in data.gov.in
Paginated catalog

After mining all the data from the first catalog, we need to return to our main page and repeat the same process for the second catalog.

Navigation — in selenium’s context it’s nothing but getting (driver.get) a URL. Don’t fall prey to using separate drivers to get a new page. We tend to use a sub-driver because the main driver keeps track of the main page and the sub-driver takes care of the pages in the sub-level. Try to find a way that can finish the job with a single driver, as using many driver instances would make use of more resources and thereby affect the performance.

I have noticed in some of the sites, for example — data.gov.in, that clicking the next page button, i.e. driver.execute_script("arguments[0].click", next_page_btn) would navigate to the next page but won’t fetch the xpath of the elements that are needed. A refresh would work in some cases but driver.refresh() comes with the cost of time. In such cases, usually, a change in the view (UI) might help. Say there’s a button on the page to change grid view to list view, clicking the button would make xpaths available much faster than the refresh does. Thus, we arrive at our next lesson —

Avoid using multiple driver instances. Try to get your work done with a single driver.

4. Mouse movements — At times, we may need to scroll the page to make the element available in the viewport or need to hover over an element to make the pop-up visible and fetch data from there. All these are considered mouse movements and these usually contribute a lot to the overall time taken by selenium to fetch data. Mocking mouse scroll and mouse movement should be the last choice. Do try to get the data from DOM itself.

Grid view of a page in data.gov.in which causes the paragraph to be truncated
Grid view of the page and truncated paragraph content

In the screenshot above, the view is set to grid and the details section is presented through the pop-up window that appears on hovering over the details’ section. By checking the console we will see that in this view even the paragraph is truncated. Hovering the mouse over each card is an option that costs a lot of time. But just by changing the view to list, the whole paragraph will be available in the elements’ section itself.

List view of a page in data.gov.in and complete paragraph
List view of the page and complete paragraph available in the devtools.

Hence it would be better if we explore the site thoroughly before starting to scrape as it would save a lot of time.

Try to keep mouse movement minimal.

5. Picking the right selectors

Be careful while using selectors. See if you can find an element by using the id attribute as it’s unique and the search will be quicker.

Try avoiding absolute XPaths.

Let’s take the example of data.gov.in again.

Finding XPath of a catalog-card on data.gov.in.

The absolute xpath to a catalog card would be, as follows.

Absolute xpath for a catalog-card in data.gov.in
Absolute xpath to the card

Whereas the relative xpath would look like this

Relative xpath for a catalog card of data.gov.in
Relative xpath to the card

As we can notice, relative XPaths make the XPaths more readable. One more advantage of relative XPaths over absolute XPaths is that — if there’s a small change in the DOM structure at the higher level, we need to rewrite the whole xpath if we use absolute XPaths. Hence if the structure of the document is likely to change or if we want a more adaptable solution, a relative XPath expression may be a better choice.

6. Keep your eyes open while you sleep

While scraping in a loop, it’s necessary to use time.sleep() at times. This is because the code runs so fast in the loop that it rarely gives time for the browser(driver) to load. This can also be achieved using explicit wait. Be watchful while you use time.sleep even for a single second. It may sound like a small amount of time but suppose you are scraping 5000 pages in an iteration and using time.sleep(1) after every page. That would be equal to 83 odd minutes. Use WebDriverWait(driver_instance, delay).until()(more about it here) - so that the scraping starts exactly when the page is ready.

If you need to scrape more pages, prefer WebDriverWait over time.sleep()

7. Uninterrupted scraping

This tip applies not only to scraping but to all the code that we write. It’s necessary to make sure that our code doesn’t end abruptly at some point and we need to run it again from the beginning. try-except is there for our help but sometimes it’s necessary to use a repeat mechanism so that scraping is successful. I have observed in some cases(like transient internet issues from the client side) that, even if the XPath is right and webdriverwaitis used, the code raises some exceptions, like - StaleElementReferenceException , TimeoutExceptionetc. In such cases, it’s essential to have a retry mechanism to make sure that the data is scraped properly. One can make use of the following code template which implements a retry mechanism.

elem_not_found = True
while elem_not_found:
try:
elem = WebDriverWait(driver_instance, 10).until(EC.presence_of_element_located(By.XPATH,"xpath_here"))
# do whatever with the elem
elem_not_found = False
except:
# print logs/ set some values etc.

Small tweaks can be made in the while loop so that, it doesn’t run eternally.

Even if you are sure about the selectors, make use of a retry mechanism so that a scraper that is supposed to run over 1000 pages doesn’t fail at say, page no. 998.

As stated initially, this list can grow with my experience.

Happy scraping!

[1] : The site is updated periodically. The catalogs shown in the given link while writing this piece may not appear on the same page always. This page is taken just as an example.

--

--