Geek Culture
Published in

Geek Culture

Scraping Dynamic Websites Through Remote WebDrivers, Selenium Part-IV

In first three parts I used selenium locally for web scraping. This time I am going to use a ‘Remote WebDriver’ to scrap some data.

Why Remote WebDriver?

Sometimes we need to run our scraper script on some CLI servers. So, selenium’s webdrivers won’t work there. In that case we may use remote webdriver, a webdriver hosted somewhere else. We just need to access it through docker.

Let’s Start

All the selectors and the script is same. The only change is the webdriver’s location. Let’s have a look at our script employing the remote driver.

The script contains a ‘scraper’ class with a method ‘open_support’. The ‘open_support’ takes two arguments, an object of ‘scraper’ class and a name of the platform. The method will then find a support URL of the platform specified in the name.

What’s New Here?

While scraping through local selenium we defined driver like this.

from selenium import webdriver
from selenium.webdriver import Chrome, ChromeOptions, Remote, FirefoxOptions

In the current script, we defined the ‘driver’ in the constructor method (__init__ method) like this;

self.driver = Remote( command_executor=’http://localhost:4444/wd/hub’, options=options )

The only change we made for web scraping remotely is in defining ‘driver’.

In ‘Remote’ the argument ‘command_executor’ refers to the URL or address of the “container” that hosts ‘selenium’.

Container!!!

Briefly, a container is like a PC that may be anywhere on Earth. It’s a PC without any hardware to care about. More on containers, may be latter.

We are done with scripting. Now we need a container that we will access through ‘command_executor’.

Let’s Prepare Container for Remote WebDriver

1- First we need to install ‘Docker’, a famous container provider. A tutorial about ‘How to install docker in Ubuntu 20.04’ can be found here.
2- Pull selenium container for chrome with

docker pull selenium/standalone-chrome

While pulling the container image, the Terminal displays some details about the image being pulled. It also displays a ‘tag’ showing the version of the image, usually latest if not specified.

image pulling a docker image

The image above is pulling some docker image. In its second line

Using default tag: latest

displays its tag as ‘latest’.
For our ‘selenium’ projects we call this tag as ‘flag’.
3- Start the docker service with

sudo service docker start

4- Initiate the container image with

docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:latest

You may replace ‘latest’ with flag of your image.
Detailed instructions about selenium containers can be found here.

Let’s Run the Scraper

> python scraper.py

6- When the scraper is done, it’s better to stop the container. You can either first stop the container and then docker service

docker container ls

Copy container id from the output then type the following and press enter.

docker stop container-id  

Replace container-id with the id you just copied.

sudo service docker stop # stop docker service

OR directly stop the docker service.

sudo service docker stop

This is it for today. We used a remote webdriver for scraping websites.

If you have not yet read then you may like to read Scraping A Dynamic Website, Selenium Part-I, Part-II and Part-III.

Happy Scraping!

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

Using std::optional for potentially invalid value

Tutorial: Standing up an EKS Cluster with Terraform

Atlassian End of Server — A Review with a View

Asynchronous data processing with RabbitMQ and Spring Framework

Rows: The Spreadsheet with Superpowers

The “Tick” Pattern — A Solution for Temporal Problems in State Machines

My Fedora: First Week with Fedora 35 Beta Candidate

Fastify v4 GA

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Irfan Ahmad

Irfan Ahmad

A freelance web scraper, enthusiast data scientist, and an independent Bioinformatics researcher

More from Medium

Steve Madden clone using pure HTML, CSS & Advanced JS

From RDBMS to Non-Relational Databases

Creating admin dashboards using the rails_admin gem

How To Export Text From Adobe Illustrator and Photoshop To Notepad?

Depiction of Text Extraction From Adobe Illustrator & Photoshop into Notepad.