Automation of Google Image Scraping using Selenium

Dian Octaviani
7 min readAug 27, 2021

--

Method of Web Scraping: using Selenium

Code is available in: https://github.com/dianoctaviani/selenium_img_scraper

Python + Selenium + Chrome Web Driver

Scraping using Selenium

This blog post explores the method of scraping Google Images using Selenium with a combination of Python and Chrome Web Driver. Providing a practical example of how Selenium can be applied. This post also aims to explain the basics of Selenium and how it works.

Background

Photo by Julia Zolotova on Unsplash

For context, we currently have a dataset containing 118 searchable pairs of keywords, consisting of both common and scientific botanical names of edible vegetables and fruits. Using the list of scientific names, we need 1 visual representation of each vegetable or fruit to build a knowledge base.

Note: The reason we rely on using scientific name instead of the common name is so that we eliminate the possibility of retrieving images based on fruit-based brand names, such as Apple Inc.

For the list of names, I created another scraping script (not included in this tutorial) to retrieve the list of common and scientific botanical names of edible vegetables and fruits.

I’ve compiled the dataset into a .csv and made it available for download within the input folder in the provided GitHub repo as well as on Kaggle: https://www.kaggle.com/dianoctaviani/scientific-botanical-names-of-fruits-vegetables

Note: This project is for learning and research only. The datasets and images are not reusable for commercialised purposes. Using a web scraper to harvest data off the Internet is not a criminal act on its own. Many times, it is absolutely legal to scrape a website, but the way you intend to use that data may be illegal.

Basics of Selenium

What is Selenium? Selenium is a tool for controlling web browsers through programs and performing browser automation. It is mainly used as a testing framework for cross-web browser platforms. However, Selenium is also a very capable tool to use for general web automation, as we are able to program it to do what a human user can do on a browser (in this case, to programmatically download images from Google).

For the automation work, we will be using a combination of Selenium, Python and Chrome Web Driver.

Locator Strategies

So how does Selenium exactly work? Well, Selenium provides the mechanisms to locate elements on a web page, which can be sequentially put together to mimic a user’s interaction on the web page.

Selenium has a range of locator strategies, to be precise, there are currently 9 different ways to locate elements using Selenium.

1. find_element_by_id - ID
2. find_element_by_name - Name
3. find_element_by_link_text - Linktext
4. find_element_by_partial_link_text - Partial Linktext
5. find_element_by_tag_name - Tag Name
6. find_element_by_class_name - Class Name
7. find_element_by_css_selector - CSS (Cascaded Style Sheets)
8. find_element_by_xpath - XPath (XML Path)
9. DOM (Data Object Modelling)

These elements can be identified using the Inspector feature available from the Developer Tool on the browser.

Browser Developer Tool
Selected a web element using Inspector

Once the web element is selected, we can then choose to copy the element as various options, such as XPath, outerHTML, JS Path etc and analyse the relevant id, names, tags or paths that we need to perform the required actions.

Headless Browser

Before we dive into the code, let’s talk about what a headless browser is and why we need to use it. In short, headless browsers are web browsers without a graphical user interface (GUI) and are usually controlled programmatically or via a command-line interface (CLI).

Automating the driving of Chrome with GUI enabled can increase CPU and/or memory usage. Both are associated with having to display the browser with rendered graphics from the requested URL. This is increased even more when there are multiple windows or tabs launched simultaneously.

Therefore, headless browsers are useful in terms of eliminating the need to launch GUIs each time the automation script is running.

Method of Automation

Setup

Code is available in: https://github.com/dianoctaviani/selenium_img_scraper

Now, let’s set up the necessary libraries required for the script.

Install Selenium using PyPi:

$ pip3 install selenium

Install the WebDriver Manager using PyPi:

$ pip3 install webdriver_manager

Sequence of Action

To obtain the images of each fruit or vegetable, we will create a script to iterate through the list of names to be searched on Google and then proceed to download the first image that appears from each search result (the images will retain original size and resolution).

I’ve separated the actions summarised above into two separate .py scripts:

1. selenium_img_src_crawler.py
2. selenium_img_downloader.py

This separation is useful to make the script appear less complicated for new starters, and it is also built for better customisation around bulk search input.

The first script (selenium_img_src_crawler.py) is used to search and retrieve image source links in bulk and outputs them into img_src_links.csv. Whereas the second script will later process the links and perform the bulk downloads.

Here’s the step by step explanation of what’s included in the first script.

  1. Importing Essentials: The following modules are required for the first script.

2. Read and store the list of fruits and vegetable names under the scientific_names column within a file called input/scientific_botanical_names_veggies_fruits.csv

Note: The search terms can be customised, just simply modify the files to include a list of search terms and change the column names.

3. ChromeDriverManager will install the web driver for chrome and skips it if an existing driver is found in the cache. A function called search_google will iteratively search the list of scientific names via Google Images by dynamically assigning the keyword to the search_url.

Also within the function, there are several options enabled, such as disable-gpu and headless browsing.

4. Expanding on the same search_google function, it has been programmed to click on the first image box that appears from the search result by finding the element through XPath.

Select the element using inspector & select the Copy XPath option

5. The next snippet within search_google function retrieves the link of the original image source by getting the attribute containing the source (src).

Base64 encoded img src content

This is where it gets a bit challenging. By default, within the thumbnail boxes, Google does not store the image src attribute as a plain URL. It is instead stored as base64 encoded hash (e.g. starts with data:image/jpeg:base64./…..), but after many tests, I have found a workaround to this.

The workaround is to perform a click on the image box and save the src link from the image shown in the side panel instead of the thumbnails of the image. We must add time sleep function in order to have sufficient time to perform this action and this will mostly return direct URL links of the image sources.

Here’s the snippet of the code:

Now we have a list of direct URLs to download the images in original resolution. However, a small percentage of the results will still return src with base64 encoding (as shown in the image below). We will have a try-catch to decode these later in the second script.

6. The next snippet calls the search_google function inside a for loop and appends the list of searched keywords and URLs of image source links onto a file called img_src_links.csv located in output/links:

  • using pipe (|) as the delimiter
  • replace whitespace with underscore (_) in search terms

Here’s the complete first script (selenium_img_src_crawler.py)

Now that we have the list of image source links, we will now process the links and download the series of images in the second script (selenium_img_downloader.py)

  1. Importing Essentials: The following modules are required for the second script.

2. The following snippet reads the list of image src that was retrieved from the previous script.

3. The following function called check_for_b64 is created to process the content of image src containing base64 encoding.

4. I defined the socket default timeout to handle and terminate any long running requests.

5. download_img() function had been written to process the request and perform downloads. It iterates through the list of src from img_src and identifies whether it is base64 src or direct URL. If it is base64 encoded hash, it will proceed to decode it else it will perform a web request to retrieve the image via image source URL.

It also dynamically assigns the name of the output file using the search terms and saves the image using the extension of .png.

There may be some 403 Forbidden Errors that occur from the series of invoked web requests, and this is due to the unauthorised requests, which is unavoidable. The try-catch will skip any of these failed requests and logs the errors onto logging/logging.log.

6. Check the output/images path to view the list of downloaded images, and we are done!

Here’s the complete second script (selenium_img_downloader.py)

Thank You for reading!

Code is available in: https://github.com/dianoctaviani/selenium_img_scraper

Just clone it and install the required module, it should run straight away. Feel free to customise the input dataset and modify file paths to suit your needs! :)

--

--