Selenium and google_images_download
Web Scraping Images from Google
Hello Readers! I am Anand, a very enthusiastic developer and an undergrad student.
Recently, i’ve been trying to figure out different ways to web scrape images from google.com based on search query using python and I’ve stumbled upon more than one method to do so. Not too sure to say which method is actually better, but to my usage I’ve incurred that It depends on the query or the kind of images we want to download.
So for this Blog I’ve decided to split the content into different snippets for ease of use later and add the final codes in the end for those who want to skip the process.
Method 1: Using “google_images_download” library
Installation:
Using PyPi - $ pip3 install google_images_download
--------------------------------------------------------------------
Using CLI - (cd into working directory)
$ git clone https://github.com/hardikvasa/google-images-download.git
$ cd google-images-download
$ python setup.py install
Usage:
Given below is a code snippet as a function call whose argument will be the query as a string. So all we’ve got to do when embedding this function into a program would be to import the library and call the function with a set of queries.
# importing google_images_download module
from google_images_download import google_images_download
The above code snippet automatically saves the downloaded images within it’s respective folder.
Method 2: Using Selenium
Installation:
To install Selenium using PyPi
$ pip3 install selenium
Usage:
Consider Selenium to be this package that lets your program do what a human user would normally do on the browser, like searching for something, going through videos on youtube, downloading images from xyz website and such can all be possible by using selenium to simulate a user’s actions.
For this tutorial we are going to pair up Selenium with Google Chrome and automate it to fetch the images we require. To do that we are going to need the Chrome driver. There is a specific chrome-driver to every version of chrome, My chrome version is V80, hence i use the ChromeDriver version 80.
You can find the version of your chrome in the “About Google Chrome” section and download the respective version of chrome-driver from HERE .
First, we will start by running a simple program which will open up the chrome window and automate a search of the query of our choice from the code.
Second phase of this program would be to search for the query, move into the images section and get the image links. To achieve the following just like the previous code, we’ll be using the chrome-driver to navigate using selectors and pages.
fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1)
The function fetch_image_urls takes 4 input arguments but the 4th argument has a default value of 1(sleep interval) so, using only 3 arguments will suffice. The function returns the image-urls.
1. ‘query’ is the string input in the textbox from the previous snippet.
2. ‘max_links_to_fetch’ is an integer input that defines the number of images required after scraping.
3. ‘wd’ is the path to the chromedriver installed on your PC or mac.
Third phase of the program is to now download the images from the returned image urls. Right now we have a bunch of image urls for a query and no scraped images in our system. Hence, for that we are going to download images using PILLOW module.
$ pip3 install pillow
----------------------------------------------------------------
Program call:
import PIL
once we download the images we need to save the downloaded images in an organised folder of our choice and for that we will be using simple OS commands from the python file.
persist_image(folder_path:str,file_name:str,url:str)
persist_image function has 3 arguments which are all mandatory for the function call.
1. ‘folder_path’ will be the path to a common folder to where you want to save all the images, mine is “/Users/anand/Desktop/contri/images”.
2. ‘file_name’ this is a str value which is of your choice, but personally i prefer passing file_name = query, because it’ll create separate folders with the query as title.
3. ‘url’ is a string input which we get from the returned values of ‘fetch_image_urls’ function.
I ran the whole program with all snippets put together with two queries, “coronavirus” and “OnePlus 8”. The result is shown below
Thank You for reading!
Just run the whole code and watch the magic happen :)
libraries to import:
import selenium from selenium
import webdriver
import time
import requests
import os
from PIL import Image
import io
import hashlib
---------------------------------
execution:
$ python3 complete_program.py