Scraping a dynamic website using Python and Selenium.

Mirankadri
5 min readDec 15, 2022

--

Introduction

Some time back I was given a task where I needed to collect data from a Spanish website for hotels bookings. We have a few of options when we come across these kind of tasks, Beautiful soup, Scrapy and selenium. Beautiful soup does not work well with dynamic websites, and I don’t prefer the way scrapy script works. Unlike scrapy and other methods where you start putting links right away, in selenium you have to first get a Webdriver for the browser you want to use for the task.

Setting system up for the project

Most important step before we initiate any project is to have your system prepared for the project. Here are steps to follow for project setup:

  1. Python - You can download python from here. Also make sure you add python to path when installing it.
  2. Create a virtual environment in your project directory. You can create virtual environment using python -m venv [name you want to give to your environment].
  3. Activate your virtual environment. I am assuming you are in the project directory by now and [environment name]\Scripts\activate will activate your virtual environment.
  4. Now, in this environment you will install selenium. pip install selenium==<required version> will install your required version of selenium. I have used version 3 for the project.
  5. Get a chrome driver depending on the version of chrome you are using. You can get your driver from here.

Now, you have your system setup for the project. It is also advised to follow the following structure of the directory:

Structure of the project directory.

Extracted data goes inside “Data” directory and driver we downloaded goes inside the “driver” directory.

Now, inside the main directory you will create a python file where you will write the script to scrape the data out of the website.

We will start by importing necessary libraries:

Importing libraries needed.

To initiate the script we will first create a driver object using selenium.webdriver in the following way:

driver = webdriver.Chrome(executable_path ='data/chromedriver.exe')

Site we want to collect the data from is rentalia and it can be found at this address “https://es.rentalia.com/

we will make the driver first access the website using the command driver.get('https://es.rentalia.com/')

Now once driver is on the website it can access all html elements present on the webpage it is on. You can refer this link to know about functions we can use to access elements present on the webpage-”https://iqss.github.io/dss-webscrape/finding-web-elements.html”.

You should be familier with the elements selector function in selenium webdriver by now and next step is to get the refrences of the elements from which you want to get your data. For that you just have to open developer options in your browser, you can simply press F12 or right click on the element you want to select and you will find option to inspect upon clicking which you will get directed to html code for it. From that code you can get the class name, html, id, or the X-path, X-path can be obtained in the menu that gets displayed upon clicking the code html element.

Here is how you would get the X-path:

Starting with the process:

First and most important point is to open website on the browser you will make your webdriver work with and start to observe processes.

For example, there is a button to accept cookies the moment the page loads. You will only be able to interact with the content inside if you have clicked that button.

You can see the block at the bottom asking you to Agree on terms.

Now once you are through it you will have to fill the inputs to get the results for area you need data for. Result page would have list of hotels with information on the rooms like, price per night, location, a brief description, etc. One thing that is not there on that page is the contact number you can use to inquire about the room. So we will have to go to the details page on each hotel room to get the number.

Listings page with the basic information on rooms.
This you will find on the details page.

I have divided the formation of the script into following steps:

  • Accessing the button to accept the terms to get to the webpage.
#clicking the accepting button 
cookies_btn = driver.find_element('xpath', '//*[@id="didomi-host"]/div/div/div/div//div[2]/button[2]')
cookies_btn.click()
# we used xpath to access the element and once we get the object made for the element we can click simply by adding click() method.
  • Filling the input with the area we would like to get the data for. This will take us to listings of hotels.
# inputing the locations from the given list 
input_field.send_keys(loc)
ActionChains(driver).send_keys(Keys.DOWN).send_keys(Keys.ENTER).perform()
# we replicated the interaction we make to input and submit the query in the input bar using action chains
  • Access elements with the information we need and getting the data out. I took title, location, link, price per night, and phone number and stored these values in respective variables.
Extracting information.

Values we obtained were for only one option on the page, to get values for all sites we need to loop over all listed options like this:

To get the data for each site I included loop inside a write statement to generate a table with only headers with respect to the variables we have and then as the loop continues code would fill rows with data from placeholder we defined and used for extraction. There were a few more things that I added to make it work well and efficiently in the collection of data, you can find the finished code here. For example, I have made the script to take inputs from the user as arguments, you will get all the required information in that repository. But before getting code from there, I would like you to implement this on your own.

Also, there is no standard rule to get the data in a csv file, so do what you think would seem interesting.

I hope this was helpful !

--

--

Mirankadri

A data science developer aiming to build a community helping people interested in learning data science, develop their skills to become data scientists.