How did I solve the issue of infinite scroll while scraping a website

Sagar Pandit
3 min readMar 19, 2023

--

I started using Selenium and BeautifulSoup while learning web scraping in Python. https://link.medium.com/rDaPIUo8hyb is my earlier article on using BeautifulSoup to scrape a website and get content out of it.

Then I decided to use Selenium to scrape a website and started facing some issues while scraping a website having huge amounts of data.

The Problem

The data from the website that I needed to scrape was loading slowly and with some delay.. i.e once I scroll the scroll bar down it used to load the remaining data and so on.. And here I started facing issue when I used to fetch any class name or css selector or id of the content and it used to give me nonetype object is not subscriptable issue.

The above error happens because after you launch your website using the selenium driver, the data you want to scrape is still not available to you and as you scroll down it slowly becomes available but by that time your logic to get the data via scraping runs and you face above error and hence you cannot get the respective content out of it.

The Solution

The solution was to scroll the website after opening the URL through the chrome driver using selenium and get the scroll height until it reaches the end of the scroll and all the data is loaded properly on the webpage using a script in the code. Below is the code for it.

last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
if name == 'ind':
time.sleep(4)
else:
time.sleep(8)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

Here, the idea is to get initially the scrollHeight and save it to the last_height variable using driver.execute_script and then using a while loop scroll to that scrollHeight. Then sleep for some seconds using time.sleep() till the webpage loads the content till that scrollHeight and again return a new ScrollHeight and assign it to the new_height variable. Compare both the new_height & scroll_height values, and until they are equal, continue the while loop. These values become equal when the whole web page has been scrolled until last, and then we have all the contents available to scrape.

This way you will get all the data you need to scrape and then you can continue further to use Selenium or other libraries to scrape the data.

The project that I created scrapes the data via selenium from a website called TNT Supermarket which gets all the products title, weight, price, brand, image URL and then again uses selenium to open a google form and populates that data into it.

The whole code is available at https://github.com/sagar160589/python-selenium-infinite-scroll-scraping . It also includes a video where I have screen captured the data getting submitted to a Google Form.

That’s all folks. In this article I wanted to mention what issue did I face while scraping an infinite scroll webpage and the solution for the same.

--

--

Sagar Pandit

An experienced IT professional having extensive experience in working with Java, Spring Boot, webservices, Microservices, AWS, Python, Docker & Kubernetes.