The Startup
Published in

The Startup

Web Scraping Stock Images Using Google Selenium and Python

A code-along guide to download images from Stock Photo sites using Selenium and Python!

This post was inspired by Fabian Bosler’s article Image Scraping with Python. Fabian does a great job explaining web scraping and provides a great boiler plate code for scraping images from Google. For our purposes, we will focus on using selenium in python to download free stock photos from Unsplash.

Unsplash is a website dedicated to sharing stock photography under the Unsplash license. The website claims over 110,000 contributing photographers and generates more than 11 billion photo impressions per month on their growing library of over 1.5 million photos. Wikipedia

Since Unsplash is an interactive site, using Selenium would be our best choice, instead of using Beautiful Soup and Request libraries. Selenium can open a browser and accepts commands to move the mouse, click on certain areas, enter certain text, etc. For this exercise you will need to download the separate WebDriver from Google. After you have installed the WebDriver, follow these steps:

Steps:

  1. Install the Python Selenium package (pip install selenium)
  2. Initiate the WebDriver
  3. Test Run the WebDriver
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
DRIVER_PATH = '/../../../../../../chromedriver'
service = Service(DRIVER_PATH)
service.start()
wd = webdriver.Remote(service.service_url)
wd.quit()

If you saw a window pop up and close, congrats! Your WebDriver is up and running now, so we will leverage Fabian’s boiler plate to analyze image and web structure.

To start, we will search for a specific phrase and save the image url.

  • *Please note that these scraping steps may change depending on Unsplash website changes to css structure and queries**

Steps to Scraping:

  1. Set query as an input, along with your web driver, and maximum links
  2. Define a function that will scroll to end of page when called
  3. Define the url for your query and an empty set to store image urls
  4. Define the css selector for images, and attributes for image urls
  5. Return list of image urls
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=3):
def scroll_to_end(wd, scroll_point):
wd.execute_script(f"window.scrollTo(0, {scroll_point});")
time.sleep(sleep_between_interactions)


# build the unsplash query
search_url = f"https://unsplash.com/s/photos/{query}"
# load the page
wd.get(search_url)
time.sleep(sleep_between_interactions)

image_urls = set()
image_count = 0
number_results = 0

for i in range(1,20):
scroll_to_end(wd, i*1000)
time.sleep(5)
thumb = wd.find_elements_by_css_selector("img._2zEKz")
time.sleep(5)
for img in thumb:
print(img)
print(img.get_attribute('src'))
image_urls.add(img.get_attribute('src'))
image_count = len(image_urls)
number_results = image_count
time.sleep(.5)
print(f"Found: {number_results} search results. Extracting links...")
return image_urls

Next, we will use the Requests and Pillow libraries for Python to download the images using the image url data. To do this we will use Fabian’s boilerplate function, persist_image. For my purposes, I used the headers parameter to assign a user agent, but you can also assign Mozilla, Windows, Safari, etc. The function requires a folder path and image url as parameters, that we will define in the next step of the process.

def persist_image(folder_path:str,url:str):
try:
headers = {'User-agent': Chrome/64.0.3282.186'}
image_content = requests.get(url, headers=headers).content

except Exception as e:
print(f"ERROR - Could not download {url} - {e}")
try:
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
with open(file_path, 'wb') as f:
image.save(f, "JPEG", quality=85)
print(f"SUCCESS - saved {url} - as {file_path}")
except Exception as e:
print(f"ERROR - Could not save {url} - {e}")

This formula will use the io library to load the image content data as Bytes. Once the Byte data is loaded, the Pillow library is used to convert the image file to an ‘RGB’ format. The last part of the process is defining a folder path to save the images, and then saving each image, specifying the type of file, and quality.

Now that we have a function to find images and save the image files from each image urls, we are ready to write our final script that will bring these two function together. The search_and_download in Fabian’s article does just this. It allows us to define the folder to store the new image files, and also allow us to pass in our search term, along with the web driver we will use in our web scraping process.

def search_and_download(search_term:str,driver_path:str,
target_path='./images-UNSPLASH',number_images=200):
target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))
if not os.path.exists(target_folder):
os.makedirs(target_folder)
with webdriver.Chrome(executable_path=driver_path) as wd:
res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=3)

for elem in res:
persist_image(target_folder,elem)

As you can see, the first thing done by this function is defining the image path. I chose the path “images-UNSPLASH”. This target path is defined by you, and this will create a sub-folder with your search term, if the folder does not exist. The formula the checks if the target folder exists, and if it down not, one is created. The web driver is started, and the first function grabs the image urls. The last step of the function will iterate through the image urls to save each one in the target folder.

Now that we have our functions, let’s put them to work!

Requirements:

  1. Import Requests library
  2. Import os and io libraries
  3. Import Pillow library
  4. Import hashlib library

For our purposed, we will use a list of search terms to iterate through, find all corresponding images, and save each image to the respective folder:

import requests
import os
import io
from PIL import Image
import hashlib
search_terms = ['beard', 'male face', 'teen face', 'shaved']for search_term in search_terms:
search_and_download(search_term=search_term, driver_path=DRIVER_PATH)

That’s a wrap! If your program ran correctly, you will see new folders set up and images stored by search term. We walked through setting up Selenium, we created three functions (courtesy of Fabian), and download free stock photos from Unsplash! Thank you!

--

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

Django) zero to hero: 0. Introduction

A simple guide to setup Symfony PHP framework in a local environment using Docker.

Create a Scalable Real-time communication App like WhatsApp with Messaging, Video and Voice calls…

HeroCTF v3 Writeup: PwnQL 1 & 2 (SQL Injection)

How to setup flutter with the android studio? — Warmodroid Blog flutter

flutter setup

Further Down The Road: Testing Mozilla’s New Web Literacy Activities

The CardanoKitchen

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jacob Tadesse

Jacob Tadesse

Data scientist transitioning from a technology consulting career. https://www.linkedin.com/in/jacobtadesse/

More from Medium

Scraping All Game In Steam Using Python

Web Scraping an Online Travel Agency Website with Python, Selenium, Beautiful Soup, and Requests

Twitter scraping without using Twitter API (2022 Guide)

Twitter

Picking a Disneyland with Data Science. Part 1