With the passage of time developers have grown enough to overcome the challenges of Java Scripts in websites. But, that is one end. As the developers have learnt and developed a lot about scraping the websites, website developers have also developed new strategies to tackle the scrapers.

This time my challenge was to scrap a Hungarian website. Though language was not a big challenge, the website itself was a challenge to scrap data from. Here is the website I was scraping.

What was my goal and what were the challenges here and how did I tackle them?


My goal was to get products’ details and images from the website, from almost all the menus.


1- an accept cookies alert

image showing cookies handler for

2- the interactive behavior of menus that is, to get the list of all menu options I had to click main menu options, then click on sub menus, then sub sub menus and so on. This was to be repeated for all main menu options. interactive menu

3- Session expiration: To scrap the products’ info I first collected all products’ links and then for each product’s link I scraped its info. But, due to large number of links and slow network the session used to expire after some iterations/links.


1- Cookies alert: This was easy to handle. Below lines of code just find the cookie alert and then accept them.

def open_url(self, url) -> None:        
# self.driver = Chrome()
# cookie handler
self.driver.find_element(By.ID, 'onetrust-accept-btn-handler').click()
return self.driver.current_url

2- Interactive menus: To tackling this issue I implemented different functions. in a While loop. Here is the detailed solution.

class Ryobi(Support):        
def __init__(self) -> None:
self.ryo_ini = 0
self.cat_ini = 1
self.sub_cat_ini = 10
self.sub_sub_cat_ini = 0
self.cat_len = 0
self.sub_cat_len = 0
self.sub_sub_cat_len = 0
self.link_counter = 0

The scraper for this site is an instance of my Support class that I once created to scraping support links of website, I had not to make many changes.

This constructor contains some definitions, basis for my script.

self.cat_ini => index counter for main categories
self.sub_cat_ini => index counter for sub-categories
self.sub_sub_cat_ini => index counter for sub-sub-categories
self.cat_len => total length counter for main categories
self.sub_cat_len => total length counter for sub-categories
self.sub_sub_cat_len => total length counter for sub-sub-categories
self.link_counter => counter for links

Here is the importance of these variables.

self.cat_ini is initially set to 1 that is to start from 2nd index of categories found. Because the 1st index was some irrelevant link. It will remain constant until all of its sub-categories and sub-sub-categories are scraped of their products’ links. It will increase by 1 when all sub-categories and sub-sub-categories of the main category are scraped. This will increase up to the total length of the main categories.

self.sub_cat_ini, the index counter of sub-category, will remain constant until all of its sub-sub-categories are scraped of their products’ links. When all the sub-sub-categories will have been scraped, this indexer will be reset to ‘0’.

self.sub_sub_cat_ini, indexer for sub-sub-categories, just increase by 1 in its keeping its parent sub-category and main category index counter.

self.cat_len is the total number of main categories. This is the limit to increase the self.cat_ini. When the this main index becomes equal to main categories’ count, the loop will break and the process will end.

self.sub_cat_len is the total number of sub-categories for a main category. It increments the self.cat_ini by 1 when self.sub_cat_ini becomes equal to self.sub_cat_len.

if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1

self.sub_sub_cat_len is the total number of sub-sub-categories for a sub-category of a main category. self.cat_ini and self.sub_cat_ini remain constant until self.sub_sub_cat_ini becomes equal to self.sub_sub_cat_len.

if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1

Here is the full implementation as pseudo code.

while True:
get main categories
if self.cat_len ==0:
self.cat_len = length(main categories)
categories[self.cat_ini].click() # click on the particular main category
get sub-categories
if self.sub_cat_len == 0:
self.sub_cat_len = length(sub_categories)
sub_cats[self.sub_cat_ini].click() # click open a sub-category

# try to find sub-sub-categories of a sub-category
sub_sub_cats = []
sub_sub_cats = self.driver.find_elements(By.CLASS_NAME, 'CategoryDropdownstyles__Link-kkzqgd-9')
if length(sub_sub_cats)>0 and self.sub_sub_cat_len == 0:
self.sub_sub_cat_len = length(sub_sub_cats)
if length(sub_sub_cats)>0:
print(f'now on sub sub category: {self.sub_sub_cat_ini}')
self.sub_sub_cat_ini += 1
products = get_products from the current link, that is either from sub-categories or from sub-sub-categoriesi = 1
for link in products:
if self.link_counter>0:
print(f'saving record number: {i}, in the dataset for the link: ', link)
ryo = os.path.join(cur_dir, f'scraped_data/')
get_details_of_the_product(link, ryo, which='ryobi')
if self.sub_sub_cat_ini == self.sub_sub_cat_len:
self.sub_sub_cat_ini = 0
self.sub_sub_cat_len = 0
self.sub_cat_ini += 1
if self.sub_cat_ini == self.sub_cat_len:
self.sub_cat_ini = 0
self.sub_cat_len = 0
self.cat_ini += 1
if self.cat_ini == self.cat_len:

3- The third problem is dealt in this snippet:


So, in this way I got the solution. Here is the link to my my repository that contains my solution. I used different solutions for this project. The repository also contains setup to use selenium hub through docker, this was my priority. It also includes setup for standalone selenium.

