How to scrape thousands of adoptable pet photos from Petfinder using Petpy API: Part 2— Downloading pet images

Jenny James
Analytics Vidhya
Published in
3 min readSep 20, 2020

Recall in part 1, we left off with a cleaned dataframe containing the pet’s id, the breeds and the photos. The next step would be to decide which photos we want to download.

We need our imports first.

import petpy
import os
import pandas as pd
import urllib.request
import urllib.error
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

We can create 3 size columns for each of the 3 image columns to display their size.

rabbitCLEANED['small_photo_size']= rabbitCLEANED['primary_photo_cropped.small'].str.split('width=', 1).str[1].str.split('&', 0).str[0].astype(int)rabbitCLEANED['medium_photo_size']=rabbitCLEANED['primary_photo_cropped.medium'].str.split('width=', 1).str[1].str.split('&', 0).str[0].astype(int)rabbitCLEANED['large_photo_size']=rabbitCLEANED['primary_photo_cropped.large'].str.split('width=', 1).str[1].str.split('&', 0).str[0].astype(int)

The results in our dataframe are the 3 columns seen below:

If we take a look at these columns we can see that the three photo sizes have a pixel width of 300, 450 and 600. We are going to use the medium size for this example because why not? So we are going to filter the data to keep only the 450 size.

medphotos=rabbitCLEANED.groupby('id').apply(lambda x: x[x['medium_photo_size'] == 450])

Choose an arbitrary number of photos to keep for each breed. I am going to choose 5 since our list is so tiny we couldn’t have more than that for each breed, but the choice will be up to you depending on how many photos you want.

medium=medphotos.groupby('breeds.primary').head(5)

We are going to take the columns we need from rabbitCLEANED and turn them into lists.

urls, breed, index = rabbitCLEANED['primary_photo_cropped.medium'].tolist(), rabbitCLEANED['breeds.primary'].tolist(), rabbitCLEANED.index.tolist()

And then we create a list of the lists: rabbitlist = [index, breed, urls]

‘We must rearrange the list of lists to be in a format that allows us to easily input the values into the Pool process as it iterates through the values.’

new_rabbitlist = []
for i in range(0, len(rabbitlist[0])):
new_rabbitlist.append([rabbitlist[0][i], rabbitlist[1][i],
rabbitlist[2][i]])

Next, we need to it to divide the photos into different directories based on primary breed. We want to add unique so it doesn’t have more than one folder for the same breed.

breed_dirs = new_rabbitlist(medphotos['breeds.primary'].unique())

Now we can define a function that creates a folder for all the breed directories called rabbit_breeds. It also assigns the image a name which will be the breed type along with the id. ‘We also make sure to write an error exception with urllib and the HTTPError for grabbing the images from the URLs.

def download_breed_images(breed_img):
try:
urllib.request.urlretrieve(breed_img[2],
os.path.join('rabbit_breeds/',
str(breed_img[1]),
str(breed_img[1]) +
str(breed_img[0]) +
'.jpg'))
except urllib.error.HTTPError as err:
print(err.code)

We use the multiprocessing pool which produces a pool of worker processes based on the max number of cores available on your computer, and then basically feeds tasks in as the cores become available. (link)
My mac has 4 cores so I’m going to choose 4 but it has worked for me when I’ve used other values so it’s somewhat arbitrary.

pool = ThreadPool(processes=4)

Here is more documentation on multiprocessing and pool.

pool.map(download_breed_images, breed_list_new)
pool.close()
pool.join()

Mine ran really fast since we only had 5 rabbits. I now have directories inside my rabbit_breeds directory containing all the breed names of the rabbits we scraped. Here is one cutie I got from this example.

Of course this can be done with any animal type that Petfinder has up for adoption, I chose bunnies because they are fuzzy and cute.

I couldn’t have done this without Aaron Schlegel’s instructions on how to download 45,000 adoptable cat images with petpy.
photos from Petfinder.

--

--