Scraping a Dynamic Website, Selenium Part-V

Irfan Ahmad
Geek Culture
Published in
5 min readJul 17, 2022

With the passage of time developers have grown enough to overcome the challenges of Java Scripts in websites. But, that is one end. As the developers have learnt and developed a lot about scraping the websites, website developers have also developed new strategies to tackle the scrapers.

This time my challenge was to scrap a Hungarian website. Though language was not a big challenge, the website itself was a challenge to scrap data from. Here is the website I was scraping.

What was my goal and what were the challenges here and how did I tackle them?

Goal:

My goal was to get products’ details and images from the website, from almost all the menus.

Challenges:

1- an accept cookies alert

image showing cookies handler for https://hu.ryobitools.eu

2- the interactive behavior of menus that is, to get the list of all menu options I had to click main menu options, then click on sub menus, then sub sub menus and so on. This was to be repeated for all main menu options.

hu.ryobitool.eu interactive menu

3- Session expiration: To scrap the products’ info I first collected all products’ links and then for each product’s link I scraped its info. But, due to large number of links and slow network the session used to expire after some iterations/links.

Solutions:

1- Cookies alert: This was easy to handle. Below lines of code just find the cookie alert and then accept them.

def open_url(self, url) -> None:        
# self.driver = Chrome()
time.sleep(0.2)
self.driver.get(url)
self.driver.maximize_window()
time.sleep(.1)
try:
Alert(self.driver).dismiss()
except:
...
# cookie handler
try:
self.driver.find_element(By.ID, 'onetrust-accept-btn-handler').click()
time.sleep(random.choice(self.sleeps))
except:
...
return self.driver.current_url

2- Interactive menus: To tackling this issue I implemented different functions. in a While loop. Here is the detailed solution.

class Ryobi(Support):        
def __init__(self) -> None:
super().__init__()
self.ryo_ini = 0
self.cat_ini = 1
self.sub_cat_ini = 10
self.sub_sub_cat_ini = 0
self.cat_len = 0
self.sub_cat_len = 0
self.sub_sub_cat_len = 0
self.link_counter = 0

The scraper for this site is an instance of my Support class that I once created to scraping support links of website, I had not to make many changes.

This constructor contains some definitions, basis for my script.

self.cat_ini => index counter for main categories
self.sub_cat_ini => index counter for sub-categories
self.sub_sub_cat_ini => index counter for sub-sub-categories
self.cat_len => total length counter for main categories
self.sub_cat_len => total length counter for sub-categories
self.sub_sub_cat_len => total length counter for sub-sub-categories
self.link_counter => counter for links

Here is the importance of these variables.

self.cat_ini is initially set to 1 that is to start from 2nd index of categories found. Because the 1st index was some irrelevant link. It will remain constant until all of its sub-categories and sub-sub-categories are scraped of their products’ links. It will increase by 1 when all sub-categories and sub-sub-categories of the main category are scraped. This will increase up to the total length of the main categories.

self.sub_cat_ini, the index counter of sub-category, will remain constant until all of its sub-sub-categories are scraped of their products’ links. When all the sub-sub-categories will have been scraped, this indexer will be reset to ‘0’.

self.sub_sub_cat_ini, indexer for sub-sub-categories, just increase by 1 in its keeping its parent sub-category and main category index counter.

self.cat_len is the total number of main categories. This is the limit to increase the self.cat_ini. When the this main index becomes equal to main categories’ count, the loop will break and the process will end.

self.sub_cat_len is the total number of sub-categories for a main category. It increments the self.cat_ini by 1 when self.sub_cat_ini becomes equal to self.sub_cat_len.

if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1

self.sub_sub_cat_len is the total number of sub-sub-categories for a sub-category of a main category. self.cat_ini and self.sub_cat_ini remain constant until self.sub_sub_cat_ini becomes equal to self.sub_sub_cat_len.

if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1

Here is the full implementation as pseudo code.

while True:
get main categories
if self.cat_len ==0:
self.cat_len = length(main categories)
categories[self.cat_ini].click() # click on the particular main category
get sub-categories
if self.sub_cat_len == 0:
self.sub_cat_len = length(sub_categories)
sub_cats[self.sub_cat_ini].click() # click open a sub-category

# try to find sub-sub-categories of a sub-category
sub_sub_cats = []
try:
sub_sub_cats = self.driver.find_elements(By.CLASS_NAME, 'CategoryDropdownstyles__Link-kkzqgd-9')
except:
...
if length(sub_sub_cats)>0 and self.sub_sub_cat_len == 0:
self.sub_sub_cat_len = length(sub_sub_cats)
if length(sub_sub_cats)>0:
sub_sub_cats[self.sub_sub_cat_ini].click()
print(f'now on sub sub category: {self.sub_sub_cat_ini}')
self.sub_sub_cat_ini += 1
products = get_products from the current link, that is either from sub-categories or from sub-sub-categoriesi = 1
for link in products:
if self.link_counter>0:
print(f'saving record number: {i}, in the dataset for the link: ', link)
ryo = os.path.join(cur_dir, f'scraped_data/hu.ryobytools.eu-sample.xlsx')
get_details_of_the_product(link, ryo, which='ryobi')
i+=1
if self.sub_sub_cat_ini == self.sub_sub_cat_len:
self.sub_sub_cat_ini = 0
self.sub_sub_cat_len = 0
self.sub_cat_ini += 1
if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1
if self.cat_ini == self.cat_len:
break

3- The third problem is dealt in this snippet:

try:
self.driver.get(url)
except:
self.driver.start_session({})
self.driver.maximize_window()
time.sleep(0.2)
self.driver.get(url)

So, in this way I got the solution. Here is the link to my my repository that contains my solution. I used different solutions for this project. The repository also contains setup to use selenium hub through docker, this was my priority. It also includes setup for standalone selenium.

If you are new here, you can check Part-I, Part-II, Part-III and Part-IV in this series.

Next I’ll post an article in the series

Scraping a Dynamic Website, But it’s not Selenium.

--

--

Irfan Ahmad
Geek Culture

A freelance python programmer, web developer and web scraper, data science and Bioinformatics student.