Richa Vala
11 min readMar 21, 2019

Predicting Pet Adoption Speed Using Python — Part II

This post is the second in a three-part series describing my attempt to develop algorithms to predict adoptability of pets — specifically, how quickly is a pet adopted? You can check out my last post here.

It had been years since I’d been on a roller coaster ride. However, the ChiPy mentorship program has felt like I’m actually riding it. Somedays I want nothing more in the world for the ride to be stopped and let me get off because it seemed so uncertain and overwhelming. Every time I feel that way, I land on my feet safely, thanks to all the resources available at my disposal. I have learned a lot since I have started the mentorship program. I have many knowledge gaps and sincerely hoping to narrow down these gaps through the mentorship program. I have been working on exploring the data, basic data cleaning and regression analysis using Logistic Regression. I’m happy to share my progress through this post and techniques that worked and that didn’t. So without further delay, let’s continue exploring the data before preparing it for the predictive machine learning model.

Exploring the Data

There are two columns for the breed category in our dataset. BreedName_breed_1 is for the primary breed and BreedName_breed_2 if the pet is not of purebred.

Interestingly, in both column features, ‘Mixed Breed’ is the most common and preferred value.

For cats, it’s mostly the ‘Domestic Short Hair’ and ‘Domestic Medium Hair’ value.

Alternatively, we can generate WordCloud to show which strings/words are frequently used in the feature.

Awesome! We can see that the ‘Mixed Breed’ is the most common breed value in the breed 1 feature.

Moving on to the next feature. There are three columns for the color feature.

Out of all the colors, it seems like white, black and brown are the most frequent colors listed in the feature. However, one or multiple colors does not speed up the adoption of the pet.

Next, let’s look at the State feature:

Interestingly, around 85% of pet ads are from Selangor and Kuala Lumpur states. It’s worth noting that the process of adoption is very slow in those two states. So, state name does not seem to increase the adoption speed.

The plot indicates that more photos of dogs do not mean faster adoption. But at the same time, having no photos will have very minimal adoption. Looking at the plot suggests that anywhere between 1 to 5 photos is a better choice.

Video Amount doesn’t seem to have a huge impact on the adoption speed. I guess I will drop this column before regression analysis as the count of the unique value of this feature is too low to use to it our predictive analysis.

Let’s look at the RescuerID feature:

Only first eights rescuers have a number of pets rescued >100. Rest of values have very low frequency.

Let’s look at the Name feature:

As you can see, most of the pets have in names in their profile data. It seems, that this feature is significant for our analysis.

Let’s generate a WordCloud to identify the most frequently used name.

It’s worth noticing that often times people write down common names such as Kitten, Kitty, Mimi, White etc, just to fill in some value in this field.

Next, let’s plot a chart to compare the Quantity for different AdoptionSpeed

As you can see in the image, there is no benefit in terms of higher adoption speed by having more pets. Overall, having a quantity of 1 pet is a stronger predictor of adoption speed.

There are a few more features in the dataset. However, I’m not discussing them here because I would like to build our model with just above-mentioned features. I have yet not figured out an approach to find the most important predicting variables that are significant for regression analysis. Instead, I have applied the subjective reasoning behind choosing the predictor variables. What information might influence a potential pet owner to adopt a pet while looking at the pet profile?

Data Cleaning: Missing Values

We need to handle the missing values in our dataset before we run the regression analysis. If the missing values are not handled properly, then the accuracy of the predictive machine learning model will be affected. There are various options available to impute the missing values. We will use the .fillna() method and replace the missing values with the most frequent value in the feature. The reason behind choosing the most frequent value to fill in the missing value is simply because they don’t have diversity to be predictive. Next step is to find out the number of missing values in each of the features in our dataset. We will us Pandas .isnull() function to detect missing values:

First, let’s find out how many missing values are present in our dataset:

This is how we will find out the most frequent value in a Breed 1 feature for Dog category:

And replace the missing values with ‘Mixed Breed’. Let’s do it:

We will use a similar approach with the rest of the features with the missing values.

Now that the data does not have any null values, we can look at options for encoding the categorical values.

Categorical Features

Before we create our model for machine learning, we need to convert the categorical features. These generally include different categories or levels associated with the observation. The variables in our dataset are categorical having a nominal numeric value. For example, the Gender column:

Gender — Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)

while others have text values stored to represent various trait. For example:

Color1 — Color 1 of pet represented by color name (white, gray, cream, black, brown etc.)

Other columns that have too many levels of values are ColorName, ColorName_2, ColorName_3, BreedName_breed_1, BreedName_breed_2, StateName, Name, Fee, Quantity, PhotoAmt, and Description.

The biggest challenge is to figure out how to manipulate the values to convert them to numerical values or nominal categorical values for the regression analysis. We will convert the categorical values to a number . One approach is to use the Label encoding method to convert the non-numeric categorical variable to numerical values. For example, primary color variable ColorName feature has seven different values:

If we use the label encoder technique on this feature, then it will represent each color with a numeric value range from 0–7. It will misinterpret that color Cream is three times better choice than color Black. Another issue with Label encoding is that if we use this technique on a categorical feature like BreedName_breed_2, then it transforms the feature into 134 numeric values and will create 134 new columns of numerical values as there are 134 unique values present in that variable. We don’t want that! This will decrease the performance of our machine learning model.

Instead, we will use dummy coding or create dummy variables — variables with only two values — 0 and 1. Basically, we are creating a duplicate of the variable. My mentor has sent me this link. It has a nice explanation about why should we use dummy variables. Here, the digits have no relationship with each other. For our ColorName variable, a dummy variable for each of the color present will be created. New column Cream will be created and 1 will be assigned to suggest that color Cream is present and 0 if it is not. We will use pandas .get_dummies() method to assign 1 being True or 0 being False to the variable. Check out the result:

Sci-kit learn has a method for binary encoding of the variable. It is called One-Hot Encoding. This article gives you an excellent explanation of One-Hot Encoding in Python. Be sure to check it out. We will not use this method to create a dummy variable since we many levels of values in our categorical variables. As for example, our primary breed variable BreedName_breed_1 has 174 unique values. Then, should we create 174 dummy variables? For this reason, we would use an alternative approach to encode our variables which are called custom encoding.

First, we will find the top most frequently used value in the feature.

Next, we will create a new column representing the value individually that would indicate whether the value is present or not.

After this, we will use the pandas .fillna() method to fill 0 where 1 is not present. In other words, the absence of that particular breed will be given the value of 0.

The resulting dataframe looks like this (only showing a subset of columns):

We will continue doing this to all the rest of our variables. They are :

Next, we will drop the original columns along with those columns such as Dewormed, ColorName_3, Sterilized, VideoAmt, PetID, RescuerID that I am not including in the first round of regression analysis. As the feature AdoptionSpeed is our target variable, we have not created dummy variables of this feature. We will not drop that column either as we need it for our machine learning model.

The final check we want to do is to see what columns now we have after the custom encoding approach:

So far, we have observed the data, analyzed it, visualized it, and cleaned the data. We’ll need to choose an algorithm to use to make predictions. There are many different types of regression methods for classification prediction. We need a classifier that can do multinomial classification. We’ll use Logistic Regression to build our machine learning model. It requires the dependent variable to be categorical. It is a machine learning classification algorithm used to predict the probability of the categorical dependent variable. Another assumption required for this technique is that the independent variables should be independent of each other. We also have a large sample size which is another assumption needed. Now, let’s build a logistic regression model, split into train and test data, make predictions and finally evaluate it.

Import Libraries

Selecting Features

Let’s create our object:

Split Data Using Train_Test_Split

We need to split our dataset into two sets — a Training set and a Test set. We will train our machine learning model on our training set and then we will test the models on our test set to check how accurately it can predict. 0.25 represents that 75% of the dataset is allocated to the training set and the remaining 25% to test set.

Fit the model using the training data

Our model is trained and ready for the next step. We can now test our model using the testing data.

Testing the model using the testing data

Let’s check the accuracy of our model

Accuracy model

Our model predicted 37% of accuracy which is significantly low. This means that our model is not able to predict the target variable correctly. The low accuracy score can be dependent on many reasons such as preprocessing of the data, choice of model for analysis, or feature selection.

Let’s present the accuracy of a model in a table.

Confusion Matrix

A much better presentation can be achieved through this code:

The values on the x-axis represent predictions whereas the values on the y-axis are true values of our test data. The number of true and predicted classes should be equal. However looking at the above table, we can easily detect that our model has not performed accurately. Predicted classes on the x-axis range from 1–4 whereas it is supposed to be from 0–4 since the range of AdoptionSpeed variable (target variable) is 0–4. We can see that range on the y-axis. Our model was not able to predict any cases of adoption speed = 0. This is most likely due to the fact that it’s a minority case.

If you want to learn more about the confusion matrix, check out this link.

I will keep tuning the parameters to improve the prediction results. I will try to calculate the estimate of the regression coefficient to examine the importance of the predictor variables. Hopefully, my last blog in the series will have better accuracy!