Photo by Deva Darshan from Pexels

A simple Selenium image scrape from an interactive Google Image Search on Mac.

Alexander Beat
Analytics Vidhya
Published in
6 min readAug 25, 2020

--

I recently came across a challenge while working on my final capstone project to graduate from Flatiron’s data science program. For the capstone we were required to compile our own dataset. My project was an image classification model so I chose to scrape some images. My first attempts at using Beautiful Soup to scrape images from Google Image Search were faulty because I could not get the scrape to scroll down the page and it would only return a small set of ten images, which was far less than what I needed to train the model. Other sites recommended I look into Selenium. Not being familiar with it, and how it worked, I found myself hitting some obstacles while trying to get Selenium to work for what I needed. Here’s a little rundown of the steps I took in order to scrape the images using Selenium and hopefully it will give you some solutions as well. This is meant to be a simple way to go about it, though there are many other use cases and ways to approach this situation and mine happens to be a very “bare-bones” method. Here is my github link the the image scraper code notebook I used for my satellite image classification project: https://github.com/alexanderbeat/riverdelta-satellite-image-classification

1. Getting ChromeDriver.

The first step will be to get your version of Chrome that you are currently running. Go to Chrome settings, then click “About Chrome” at the bottom of the sidebar. It’ll display your current version. Proceed to the ChromeDriver link to download, which is needed to run with Selenium. It’ll tell you which version to download based on your current Chrome version. Unzip that to get the chromedriver.exe.

Click the About Chrome button in the Chrome settings sidebar.
The different versions of Chromedriver to download based on your Chrome browser version.

2. Show hidden files — usr/local/bin

With ChromeDriver downloaded, you will need to place it in a hidden folder. The typical way to show hidden folders in your Finder window is typing Cmd + Shift + . (dot). I’m running this on an older ’09 Mac with El Capitan so that shortcut won’t work and the Terminal will have to be used. Open Terminal and type “defaults write com.apple.Finder AppleShowAllFiles YES” and then press Enter.

How the command looks in the Terminal window.

This should allow hidden files to appear in your finder window. If they do not appear, relaunch Finder and then they should show up. In Finder, if you click on your startup drive, there will be a hidden folder called “usr”. Go to usr/local/bin and place ChromeDriver in there.

The hidden usr folder in your startup drive. Your startup drive may have a different name.
local folder contained within usr.

Afterwards, you can hide the hidden files again by typing “defaults write com.apple.Finder AppleShowAllFiles NO” into Terminal and pressing Enter. Hidden files should no longer appear in Finder.

The command to hide the hidden files shown in Terminal.

3. Use Selenium in your notebook.

Set up and import.

Now in a Jupyter notebook, you can pip install selenium, and import webdriver, then instantiate it.

!pip install seleniumfrom selenium import webdriver
import time
#instantiate the driver
driver = webdriver.Chrome(‘chromedriver’)

Set up your URL.

Then type in the url you want to scrape. This will open a Chrome window of your URL.

url = https://www.your_url.comdriver.get(url)

Scroll down the page.

Then you’ll want to use the following code to have the driver interactively scroll down to the bottom of the page so you can access all of the images. Previously importing time will let you use a sleep count to allow for the page to load. Here are a couple of other links about this topic that I found helpful in getting the scrape to work:

https://github.com/rmei97/shiba_vs_jindo/blob/master/image_scraper.ipynb.

page_scroll_sleep = 2

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait to load page
time.sleep(page_scroll_sleep)

# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")

if new_height == last_height:
#break #insert press load more
try:
element = driver.find_elements_by_class_name('mye4qd') #returns list
element[0].click()
except:
break
last_height = new_height

The class “mye4qd” is for a button that appears at the bottom of the page that looks like what is shown in this image. The driver code will click this button in order to keep scrolling to the bottom of the page.

The class “mye4qd” of the “Show more results” button shown in Inspect.

Get image class objects.

Now that you’re scrolled to the bottom of the page and all images are loaded, this will find all of the image classes in the page.

# gets link list of imagesimage_links = driver.find_elements_by_class_name(‘rg_i.Q4LuWd’)

Pull image links from image objects.

These loops below will pull the actual image source link from each image class in the “image_links” variable created above. Sometimes the links are a “src” link, and sometimes they are a “data-src” one. It helps to determine by right clicking on one of the images on the search page, and clicking “Inspect” to see what I mean. These loops will gather both types, which you can combine into a single list of you’d like, or pull images from each of them separately.

src_links = [image_links[i].get_attribute(‘src’) for i in range(len(image_links))]data_src_links = [image_links[i].get_attribute(‘data-src’) for i in range(len(image_links))]
Right click on the image and click to inspect.
This will show the “src” link for that specific image.

Save images with urllib.

The sleep times list with numpy will randomly grab each image from your list of links at different time intervals to help avoid getting blocked by a site if it thinks you’re a bot. Urllib.request will save your images in the current working directory you are using for your notebook. Tqdm is a basic progress bar that will show you the progress as the page iterates, scrapes and saves your images.

import urllib.requestimport numpy as npfrom tqdm import tqdmsleeps = [1,0.5,1.5,0.7]

The f string will number your images based on it’s index position in the list of image links and this loops through the list called “links” which contains all of the “src” and/or “data-src” links you scraped. And finally, “driver.quit()” will close the interactive Chrome window process.

# urllib save images into folder and renames using data_name string
for i,link in enumerate(tqdm(links)):

name = data_name+f'{i}.jpeg'

urllib.request.urlretrieve(link, name)
time.sleep(np.random.choice(sleeps))

driver.quit()

Conclusion

Scraping images can be troublesome if you are using the images in an unethical way, so I wouldn’t recommend this for the purpose of stealing images for any sort of monetary gain or similar activity. I merely am showing this as a way that I was able to obtain a set of images for use in an educational school project as an example. Keep that in mind since a lot of Google images tend to have a copyright on them.

--

--

Alexander Beat
Analytics Vidhya

Data scientist. Flatiron grad. Artist converted to tech. Fascinated by technology, space, global culture and history. linkedin.com/in/alexanderbeat