Scraping Dynamic Websites Through Remote WebDrivers, Selenium Part-IV
In first three parts I used selenium locally for web scraping. This time I am going to use a ‘Remote WebDriver’ to scrap some data.
Why Remote WebDriver?
Sometimes we need to run our scraper script on some CLI servers. So, selenium’s webdrivers won’t work there. In that case we may use remote webdriver, a webdriver hosted somewhere else. We just need to access it through docker.
All the selectors and the script is same. The only change is the webdriver’s location. Let’s have a look at our script employing the remote driver.
The script contains a ‘scraper’ class with a method ‘open_support’. The ‘open_support’ takes two arguments, an object of ‘scraper’ class and a name of the platform. The method will then find a support URL of the platform specified in the name.
What’s New Here?
While scraping through local selenium we defined driver like this.
from selenium import webdriver
from selenium.webdriver import Chrome, ChromeOptions, Remote, FirefoxOptions
In the current script, we defined the ‘driver’ in the constructor method (__init__ method) like this;
self.driver = Remote( command_executor=’http://localhost:4444/wd/hub’, options=options )
The only change we made for web scraping remotely is in defining ‘driver’.
In ‘Remote’ the argument ‘command_executor’ refers to the URL or address of the “container” that hosts ‘selenium’.
Briefly, a container is like a PC that may be anywhere on Earth. It’s a PC without any hardware to care about. More on containers, may be latter.
We are done with scripting. Now we need a container that we will access through ‘command_executor’.
Let’s Prepare Container for Remote WebDriver
1- First we need to install ‘Docker’, a famous container provider. A tutorial about ‘How to install docker in Ubuntu 20.04’ can be found here.
2- Pull selenium container for chrome with
docker pull selenium/standalone-chrome
While pulling the container image, the Terminal displays some details about the image being pulled. It also displays a ‘tag’ showing the version of the image, usually latest if not specified.
The image above is pulling some docker image. In its second line
Using default tag: latest
displays its tag as ‘latest’.
For our ‘selenium’ projects we call this tag as ‘flag’.
3- Start the docker service with
sudo service docker start
4- Initiate the container image with
docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:latest
You may replace ‘latest’ with flag of your image.
Detailed instructions about selenium containers can be found here.
Let’s Run the Scraper
> python scraper.py
6- When the scraper is done, it’s better to stop the container. You can either first stop the container and then docker service
docker container ls
Copy container id from the output then type the following and press enter.
docker stop container-id
Replace container-id with the id you just copied.
sudo service docker stop # stop docker service
OR directly stop the docker service.
sudo service docker stop
This is it for today. We used a remote webdriver for scraping websites.