Creating a Machine/Deep Learning Dataset from Google Images using Selenium

Published in

DeepKlarity

4 min readOct 8, 2020

This blog is a part of series Clean or Messy Classifier . Check out the entire series for detailed implementation of deploying a machine learning classifier into web app using various frameworks.

Selenium is an open-source tool to automate a web browser. It provides a single interface for writing scripts in multiple languages. These scripts then are executed by a respected browser-driver.

Importance of Dataset in a Machine/Deep Learning

Dataset and its quality play the most crucial role in determining how good a Machine/Deep Learning model will turn out to be. No matter how good the algorithm/technique is, it will ultimately fail if the quality data is not provided to it. For every type of problem, a specific type of dataset is used. Datasets are broadly divided into these categories: numerical data, categorical data, time-series data, and text data.

Hence it becomes quintessential to have and work on the most relevant dataset to solve ML/DL problems. One of the most difficult datasets to create or maintain is the image file-based dataset as its relevance changes drastically from one problem statement to another. Different platforms and websites can be used to get the required relevant image dataset but it is often insufficient. One of the most accurate and rich sources of the image dataset is the keyword-based searches in Seach Engines or keyword-based image search engines like Google Images and Bing Images etc.

In this article we will go through a technique to automate creating a image dataset from Google Images using Python with Selenium, based on a query file. The entire code with dependencies can be found at the end of this article.

Automating Image Dataset creation from Google Images

The steps involved in the process

Creating a CSV file that contains data like keywords, number of images to be downloaded, resolution, etc.
Fetching image URLs for a given keyword
Resizing the image from the fetched URL set to the input resolution.
Downloading the finally resized image onto the local drive.

Creating a CSV file of queries/keywords

CSV file contains a list of keywords and other parameters that could be set manually and differently for each keyword in order to create a dataset more customized to our data needs. The parameters included are :

Keyword and Number of Images: The keywords to be used for searching related images and the number of images to be downloaded for a given keyword.
Resolution: The resolution to be set for the given scrapped image in terms of the pixel.

Fetching image URLs for a given keyword

This is the phase where we will scrape Google Images for the URLs of the mentioned keyword and for that we will be using Selenium. Selenium requires browser drivers to automate the web browser. In this article, we have used Chrome. The chromedrivers can be downloaded from this link.

Defining chromedriver or web driver.

wd = webdriver.Chrome(executable_path = DRIVER_PATH)

Searching the google images for a given query or keyword.

wd.get(search_url.format(q=query))

Using a CSS selector to get the image elements.

actual_images = wd.find_elements_by_css_selector('img.n3VNCb')

While creating the large dataset, the images loaded at one time or page were not enough so for that purpose we included the infinite scroll in could be added.

def scroll_to_end(wd):
 wd.execute_script("window.scrollTo(0,document.body.scrollHeight);")
 time.sleep(sleep_between_interactions)

Extracting image links from the image elements.

for actual_image in actual_images:
    if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
        image_urls.add(actual_image.get_attribute('src'))

Resizing and Downloading the image, set to resolution from the fetched URL.

After the image URLs have been fetched they are to be set to a resolution of our choice to get homogeneity in the dataset and also in accordance with our system configuration or support. For this purpose, we use the Pillow library of python which is used for image manipulation, etc. requests library is used to process the fetched URL to be used by Pillow functions.

Opening the images from the URL itself

response = requests.get(url)
img = Image.open(BytesIO(response.content))

Resizing the resolution of the image.

newsize = (x,y)
img = img.resize(newsize)

Downloading the files onto the local drive.

mg.save(download_path+q+"_"+str(i)+".jpeg","JPEG")

Images resized and downloaded for two different keywords in the dataset folder.

Note: The scrapper might produce some error while scrapping and return NULL URL set, to tackle that try-except block has been used, in that case, try re-running the code or changing the parameters in query.csv

Dependencies / Requirements

Install the chromedriver from the following link.

For the entire code for this project follow the Github link.

That’s it for this article. Please comment and share if you face any issues or errors using this approach. If you have used an alternate approach please share.