How to scrape thousands of adoptable pet photos from Petfinder using Petpy API: Part 1 — Gathering pet data

Jenny James
Analytics Vidhya
Published in
3 min readSep 19, 2020

While working on my Capstone, using the Petpy API and following along with Aaron Schlegel’s instructions on how to download 45,000 adoptable cat images with petpy, I ran into a few trouble spots when trying to use some of the attributes. I realized they had changed and I found it very difficult to find updated information. I was able to figure it out and I even found faster, shorter ways to do things and I wanted to share it with you.

The first thing you will want to do is install petpy.
In terminal pip install petpy or in Jupyter ! pip install petpy

Next, you will need to do is go to the Petfinder Developer page and get an API.

Now, let’s get started. You will want to import petpy and define your API and secret

import petpy
pf = petpy.Petfinder('your key', 'your secret')

Now we can start searching pets using pf.animals, but first, let's take a look at all the parameters:

animal_id, animal_type, breed, size, gender, age, color, coat, status, name, organization_id, location, distance, sort, pages, good_with_cats, good_with_children, good_with_dogs, before_date, after_date, results_per_page, return_df

For this example we will focus on just a few parameters but the full documentation can be found HERE:

animal_type - This would be a string and would be the type of animal being searched for.

size - The options for size are 'small', 'medium', 'large' and 'xlarge'.

location - This can be in three formats: lat-long, zip code or City, State.

distance - The default is 100 miles but you can enter up to 500 miles.

pages - The number of pages you want returned.

results_per_page - The number of results you would like returned per page.

return_df Whether or not you would like a Pandas DataFrame returned.

Let’s search for some bunnies!

rabbit = pf.animals(animal_type='rabbit', size='small', location=76825, distance=500, pages=1, results_per_page=5, return_df=True)

I chose the center of Texas for a location and used the maximum distance to search for 5 small rabbits. The results returned a dataframe with 5 rows for my 5 rabbits and 48 columns of different attributes, photos and info. Here is the full list: ['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender', 'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id', 'photos', 'videos', 'status', 'status_changed_at', 'published_at', 'distance', 'breeds.primary', 'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary', 'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered', 'attributes.house_trained', 'attributes.declawed', 'attributes.special_needs', 'attributes.shots_current', 'environment.children', 'environment.dogs', 'environment.cats', 'primary_photo_cropped.small', 'primary_photo_cropped.medium', 'primary_photo_cropped.large', 'primary_photo_cropped.full', 'contact.email', 'contact.phone', 'contact.address.address1', 'contact.address.address2', 'contact.address.city', 'contact.address.state', 'contact.address.postcode', 'contact.address.country', 'animal_id', 'animal_type', 'organization_id']

We are not going to need most of these columns when scraping images so I am going to drop some to make our dataframe cleaner, but first we should save our data to a csv just in case we mess up and need to start over. We really don’t want to re-scrape the data, especially if there are thousands of results.

rabbit.to_csv('./rabbits.csv')

Now I am going to create a new dataframe from the rabbit data frame that only contains the columns that contain the words id, breed or photo and call it rabbitCLEANED

rabbitCLEANED = rabbit[rabbit.columns[rabbit.columns.str.contains('id|breed|photo')]]

There are still a few more columns we don’t need, so we can drop them.

rabbitCLEANED.drop(columns=['organization_id', 'organization_id','organization_animal_id', 'videos'], inplace=True)

I am going to export my cleaned dataframe to a csv for back up.

rabbitCLEANED.to_csv('./rabbitCLEANED.csv')

Now let’s take a look at our dataframe

The only columns we are left with are the ones containing photos, breeds and the animal’s id. Which leaves us at a good stopping point. Here is part 2 where we download our images!

photos from Petfinder

--

--