Web Scraping Stock Images Using Google Selenium and Python

Published in

The Startup

4 min readJun 19, 2020

A code-along guide to download images from Stock Photo sites using Selenium and Python!

This post was inspired by Fabian Bosler’s article Image Scraping with Python. Fabian does a great job explaining web scraping and provides a great boiler plate code for scraping images from Google. For our purposes, we will focus on using selenium in python to download free stock photos from Unsplash.

Unsplash is a website dedicated to sharing stock photography under the Unsplash license. The website claims over 110,000 contributing photographers and generates more than 11 billion photo impressions per month on their growing library of over 1.5 million photos. Wikipedia

Since Unsplash is an interactive site, using Selenium would be our best choice, instead of using Beautiful Soup and Request libraries. Selenium can open a browser and accepts commands to move the mouse, click on certain areas, enter certain text, etc. For this exercise you will need to download the separate WebDriver from Google. After you have installed the WebDriver, follow these steps:

Steps:

Install the Python Selenium package (pip install selenium)
Initiate the WebDriver
Test Run the WebDriver

import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
DRIVER_PATH = '/../../../../../../chromedriver'service = Service(DRIVER_PATH)
service.start()
wd = webdriver.Remote(service.service_url)
wd.quit()

If you saw a window pop up and close, congrats! Your WebDriver is up and running now, so we will leverage Fabian’s boiler plate to analyze image and web structure.

To start, we will search for a specific phrase and save the image url.

*Please note that these scraping steps may change depending on Unsplash website changes to css structure and queries**

Steps to Scraping:

Set query as an input, along with your web driver, and maximum links
Define a function that will scroll to end of page when called
Define the url for your query and an empty set to store image urls
Define the css selector for images, and attributes for image urls
Return list of image urls

def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=3):
    def scroll_to_end(wd, scroll_point):  
        wd.execute_script(f"window.scrollTo(0, {scroll_point});")
        time.sleep(sleep_between_interactions)    
 
        
    # build the unsplash query
    search_url = f"https://unsplash.com/s/photos/{query}"# load the page
    wd.get(search_url)
    time.sleep(sleep_between_interactions)  
    
    image_urls = set()
    image_count = 0
    number_results = 0
    
    for i in range(1,20):
        scroll_to_end(wd, i*1000)
        time.sleep(5)
        thumb = wd.find_elements_by_css_selector("img._2zEKz")
        time.sleep(5)
        for img in thumb:
            print(img)
            print(img.get_attribute('src'))
            image_urls.add(img.get_attribute('src'))
            image_count = len(image_urls)
            number_results = image_count
            time.sleep(.5)
        print(f"Found: {number_results} search results. Extracting links...")return image_urls

Next, we will use the Requests and Pillow libraries for Python to download the images using the image url data. To do this we will use Fabian’s boilerplate function, persist_image. For my purposes, I used the headers parameter to assign a user agent, but you can also assign Mozilla, Windows, Safari, etc. The function requires a folder path and image url as parameters, that we will define in the next step of the process.

def persist_image(folder_path:str,url:str):
    try:
        headers = {'User-agent': Chrome/64.0.3282.186'}
        image_content = requests.get(url, headers=headers).content
        
    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

This formula will use the io library to load the image content data as Bytes. Once the Byte data is loaded, the Pillow library is used to convert the image file to an ‘RGB’ format. The last part of the process is defining a folder path to save the images, and then saving each image, specifying the type of file, and quality.

Now that we have a function to find images and save the image files from each image urls, we are ready to write our final script that will bring these two function together. The search_and_download in Fabian’s article does just this. It allows us to define the folder to store the new image files, and also allow us to pass in our search term, along with the web driver we will use in our web scraping process.

def search_and_download(search_term:str,driver_path:str,
target_path='./images-UNSPLASH',number_images=200):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))if not os.path.exists(target_folder):
        os.makedirs(target_folder)with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=3)
        
    for elem in res:
        persist_image(target_folder,elem)

As you can see, the first thing done by this function is defining the image path. I chose the path “images-UNSPLASH”. This target path is defined by you, and this will create a sub-folder with your search term, if the folder does not exist. The formula the checks if the target folder exists, and if it down not, one is created. The web driver is started, and the first function grabs the image urls. The last step of the function will iterate through the image urls to save each one in the target folder.

Now that we have our functions, let’s put them to work!

Requirements:

Import Requests library
Import os and io libraries
Import Pillow library
Import hashlib library

For our purposed, we will use a list of search terms to iterate through, find all corresponding images, and save each image to the respective folder:

import requests
import os
import io
from PIL import Image
import hashlibsearch_terms = ['beard', 'male face', 'teen face', 'shaved']for search_term in search_terms:
    search_and_download(search_term=search_term, driver_path=DRIVER_PATH)

That’s a wrap! If your program ran correctly, you will see new folders set up and images stored by search term. We walked through setting up Selenium, we created three functions (courtesy of Fabian), and download free stock photos from Unsplash! Thank you!

Web Scraping Stock Images Using Google Selenium and Python

A code-along guide to download images from Stock Photo sites using Selenium and Python!

Written by Jacob Tadesse