An End-to-End Approach Leveraging Computer Vision, NLP to Enable Better Pet Adoption Matching

Guanhua Shu
Institute for Applied Computational Science
7 min readAug 29, 2023

This article was produced as part of the final project for Harvard’s AC215 Fall 2021.

Authors: Benjamin Liu, Ivan Shu, Shang Gao, Xiang Bai, Yuxin Xu

GitHub Repository: Link here

Background

The premise of our development was to design such an app that would harness both the computer vision and natural language processing AI techniques for pet adoption website capabilities. Austin Pets Alive (APA) is a nationwide association striving for pet adoption and fostering with a database of available pet photos. Our aim henceforth was to build a reusable and scalable application, design, and framework that could be implemented in any animal welfare nonprofits to connect future pet owners with pets.

Specifically, our team focused on creating a comprehensible tool to match potential dog loving adopters/owners to dogs available for adoption. The core business problem to solve was to help future dog owners find a dog who would be a good fit for their lifestyle and family environment. The tailored solution was firstly to allow the user to search for dogs based on filters of size, color, and breed, and the option to accurately match dog profiles to user uploaded images. Secondly, we would connect a selected dog with the user by allowing the user to chat with a persona of the dog. The user would be able to ask this virtual dog any question about themselves — their breed characteristics, or any general questions about puppies and dogs.

Regarding the Data and Data Logistics

All the data is stored in google cloud platform, and models will be run and trained in Colab. We plan to containerize both front-end and back-end applications using docker and deploy the app using Kubernetes. To accelerate training process, we resized the images, created TFRecords, generated predicted embedding, trained language models and stored them in our GCP bucket.

APA makes available a repository for its animals that is roughly ~17k dog records and close to 40k if you include cats. For the ~40k pets there looks to be ~140k photos. Our dataset consists of several csv files [Dataset Link], including dogs metadata, dogs images, website memos, as well as dogs questions and answer datasets.

Figure 1: Sample photos from Austin Pets Alive repository.

App Component I: Computer Vision

To help the matching process, we leveraged several computer vision models to accomplish a few specific tasks. As an adoption center, when adopting new dogs from users, a function will be available for some pre-processing of the images.

  • Remove noisy background from the uploaded dog picture using DeepLabv3 plus.
  • Allow to choose and add new backgrounds/effects.
  • Enhance the image if the solution of the uploaded picture is not ideal.
Figure 2: Image segmentation using DeepLabv3 Plus with background replacement

The leftmost profile displays the original image, and the second image was after background removal by DeepLabv3 Plus. The third and forth images are different backgrounds users or adoption sites can choose when donating or adopting new dogs.

On the other hand, to allow for users searching for their interested dogs, some other tasks listed below need to be considered.

  • Create embeddings using EfficientNet for all the dog images in the dataset.
  • Search top 8 similar images using Facebook AI Similarity Search (FAISS).
  • Segment and remove background before embedding search by using DeepLabv3 plus.

One way to search similar images given an input is to create image embeddings and look for similar ones in that space. To generate embeddings, we used EfficientNet, B0 model [1], as shown to have better performance with less parameters. Then we used an embedding search algorithm developed by Facebook, FAISS to find the top 5 similar images [2]. See Figure 3 for results.

Figure 3. Example Matched Images By Using EffecientNet and FAISS Embedding Search. The top image shows the input photo and the second row yields the top 5 matched images using FAISS.

However, one problem we encountered is that when generating embedding for an input image that contains noise or features not related to dogs, the matched images inherit irrelevant features as evidenced in Figure 4.

Note: the main steps of model training were referenced by Rashmi’s notebook during 2021 ComputeFest.

Figure 4. Example Input Images That Contains Dog-Irrelevant Features

Here we can see that when the input image has a large dog-irrelevant feature (tree), the matched images contain dog-irrelevant features as well. For example, the matched images contain arms, chairs or humans. This causes the queried pictures to contain less information about the dog, which led to unideal results. To bypass this constraint, we utilized again the trained model, DeepLabv3 plus to pre-segment the image before embedding search [3].

Figure 5. Pre-segment Input Images Before Embedding Search Similar Images

After the image was segmented and the background was removed, the matched images contained more features or information related to dogs, such as age, or fur color. One caveat however, is that the matched images might contain multiple dogs because of the segmentation.

App Component II: Natural Language Processing

The next task was to create a persona of the dog that the user has searched and selected, and allow users to ask more information regarding the dog by directly chatting with the persona. To implement a baseline model, we used the BERT question and answering model, which was trained on the Stanford Question Answering Dataset (SQuAD) [4]. We provided sample questions and reference text to the model and let the model predict the start and end token of a “span” of text which acted as the output answer.

Note: the main steps were followed from Shiva’s notebook in 2021 Harvard IACS ComputeFest [5] and a tutorial by Chris McCormmick [6].

Figure 6. Probability scores for start and end tokens predicted by BERT with an example question

In one example, we asked the question, “what’s Emma’s breed” and fed it together with a reference text. The model correctly predicted with the correct span of text in the reference, which is “retriever , yellow labrador / mix”. To further visualize the probabilities of each start and end tokens, we plotted the results in Figure 5. Both “retriever” and “mix” are predicted with highest probabilities for either start and end tokens.

However, one major drawback about BERT was its limitation to answer questions by providing a span of text in the reference text and this prevented forming a natural conservation like human-to-human interaction between users and the app. To circumvent this problem, we decided to try generative-based language models, GPT2, in this case, for question and answering tasks. Despite the benefit of being able to generate new texts, vanilla GPT2 does not fully meet the criteria of question and answering. Therefore, we used GPT2 double heads with the additional head to choose the best answer to input questions. We did two-steps fine-tuning to achieve this task.

First, we trained it on the Person-Chat dataset from Facebook [8] to have a baseline conversation feature built in the model. Then, we fine-tuned it a second time to tailor the output specifically to our dog dataset. To make sure model output is not static and more resemble human-human conversation, we did the second fine-tuning with different versions of data generation. Some of the model performance can be reflected in the back and forth dialogue in Figure 7.

Figure 7. Example output from fine-tuned GPT2 Double Head model on dog “Larry”

Other Notable App Features

In addition to the AI modeling used in our application, we included a “rehome” option in our web app to provide users the opportunity to put stray dogs up for adoption by providing basic information and images.

Figure 8: Rehome option page for filling pet information

Future Work

For the next step, we are planning on adding more components to our frontend design. For example, we want to include a feature where users can change the background of uploaded images when rehoming stray dogs. We also want to create a database using Postgre-SQL or MongoDB that can help developers more conveniently to store and query data.

Acknowledgment

Many model trainings in this project were based on the 2021 Harvard IACS ComputeFest [7] for our learning purpose. We want to thank Shivas, Rashmi for their example codes on NLP and CV tasks. Especially, we want to thank our project Teaching Fellew/Tech Lead, Shivas, for his time, help, and guidance in the whole process of completing this project.

Reference

1. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan et al.

2. FAISS, Facebook AI Similarity Search.

3. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, Liang-Chieh Chen et al.

4. SQuAD, Stanford Question Answering Dataset.

5. 2021 Harvard IACS ComputeFest Computer Vision Task Notebook.

6. Question Answering with a Fine-Tuned BERT, Chris McCormick.

7. 2021 Harvard IACS ComputeFest GitHub Repository.

8. Personalizing Dialogue Agents: I have a dog, do you have pets too?, Saizheng Zhang et al.

--

--