Spotted! An app to help lost dogs get home.
by Olga Leushina and Alex Coward
This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.
Project Github Repo — https://github.com/ol500/AC215_spotted
Video — https://youtu.be/zoyyRo06kFg
Table of Contents
- Introduction
- Machine Learning Components
- Frontend
- API
- Deployment
- Next steps
- Acknowledgements
- References
Introduction
The goal of our project was to create an app that could use computer vision to help pet owners track down their pet if it goes missing. The idea came from seeing a lost pet flier on a lamp-post in the park and realizing that technology could provide a better solution. We decided to focus on dogs as a first task, though the app could be expanded to cover cats and other pets in the future.
The app has two primary intended users, dog owners who have lost their dog and people who find or spot a lost dog and want to inform the owner of its whereabouts. Pet owners upload images of their lost dog and per finders upload images of the dog they have found. If there is a match with previously uploaded pictures the app provides an image of the matching dog along with contact info for the person who uploaded the image so the user can get in touch.
Machine Learning Components
We explored two potential paths for matching lost and found dogs from images.
The first was to use a machine learning classifier to determine the dog breed of the image of a lost and found dog to determine if they are classified as the same breed. This data could then be combined with geo-location data to determine if there is a high likelihood of a match.
Our second path was to use vector embedding representations of the image and to use similarity between the vectors to determine the degree of similarity between the images and by extension, the dogs. After experimenting with both approaches, we opted to use image vector embeddings.
Vector Embedding Overview
A vector embedding is the vector representation of a given input by a particular machine learning model. For example in a machine learning image classifier, the model is trained to be able to output the probability that an input image belongs to the predetermined number of classes that are set as the final output of the model. A vector embedding is a vector representation of the image that the model produces, and the vector embedding is said to have semantic meaning in that a particular vector represents a particular image and two similar images, with a high likelihood of being assigned to the same class by the classifier, should have similar vector embeddings. An image vector embedding varies in size depending on the architecture of the model that produces them, and the representation will be more or less rich depending on how well the model captures the specific features of the input image.
We decided to use the vision transformer model unveiled in the paper “An Image is Worth 16x16 Words”(https://arxiv.org/abs/2010.11929) to generate our vector embeddings. Our plan was to use a pre-trained model that we could then fine-tune to have the model get more exposure to dog images prior to using it to generate embeddings.
We opted for pre-trained Google Vision Transformer (ViT) models available via Huggingface that were pre-trained on ImageNet-21k (14 million images, 21,843 classes). We experimented with three versions, a base model (google/vit-base-patch16–224-in21k), a large model (google/vit-large-patch16–224-in21k) and a huge version of the model (google/vit-huge-patch14–224-in21k). The huge model provided the best results from the pre-trained checkpoint but was too large for us to finetune, though we did experiment with fine-tuning the large and base model to see if we could get similar results in a smaller model. Unfortunately, the embeddings generated from the fine-tuned models did not perform as well as the pre-trained huge model, so we have opted to use the huge pre-trained model. We used the model to generate embeddings from the Austin Pet’s Alive dataset that consists of ~66k images to be able to experiment with vector embeddings in preparation for building the feature into our app. Once embeddings are generated, we needed a way to store them and to search for similarity among vectors.
FAISS for vector embedding similarity search
To store and search our vector embeddings, we opted for the FAISS (Facebook AI Similarity Search) python library. The FAISS library provides tools for saving vector embeddings in a FAISS index that allows for fast searching for similarity among vectors. There are various FAISS indices that based on different search methods and with different methods for compressing vectors to save space. Given that both the embeddings we generated for experimenting and the embeddings that the app will store during its use in the project are relatively small in number, we opted to use a flat index that stores the vectors without compression. We chose the IndexFlatIP index that performs an inner product between a query vector and vectors in the index, with the k largest inner products being returned as the matches. We normalized our vectors prior to storing them in the index and prior to queries so that the inner product would be cosine similarity, which provides an intuitive value between 0 and 1 for the similarity value.
Object Detection Layer
In our experiments, we discovered a problem with our approach. The vector embeddings we generate are representations of the image we pass to the model, not of the dog in the image we pass to the model. In our experiments that led to the following result for the 10 closest matches for the dog on the left:
The presence of a pumpkin in each image meant that the images were similar, even if the dogs themselves do not match closely at all. This led us to the idea of using an object detection model as a pre-processing layer prior to generating embeddings. We opted for the DEtection TRansformer (DETR) model, an enconder-decoder transformer with a convolutional neural network backbone. The specific one we used was the pre-trained facebook/detr-resnet-50 model available from Hugging face. We then wrote code that would crop images at the bounding boxes provided by the model that had the highest probability of containing a dog.
Frontend
We built a React frontend where users can upload an image of a dog, select whether they are uploading a lost or found dog and provide their contact info so people can reach out to them if they upload a matching dog.
In our MVP developed for the class, users are limited to uploading one dog image at a time and only receive a response with the contact info of the person who posted the dog image or images that closely match the image they are uploading at the time they are uploading it. They could do this several times if they want to upload numerous pictures of the dog, but there’s no batch upload option. In the future the app could have user accounts with the ability to load and manage multiple images and contact info along with an in-app messaging service to make it easier to reunite lost pets with their owners.
If there are any matches, the criteria for which we will explore further in the API section, the user then gets the following info on up to the top 5 matches:
- The image of the potential matching dog
- The similarity, as determined by cosine similarity of vector embeddings, to the dog image they have uploaded
- The Name of the person who uploaded the matching dog image
- The email address for the person who uploaded the matching dog image
- The phone number for the person who uploaded the matching dog image
API
We implemented a FastAPI server using Uvicorn Server to handle the interaction between the frontend and the backend. The API receives the user’s input and then has two main tasks, returning data for potential matches and saving the uploaded data for use in future searches.
API Image Matching Task
Given the large size of the ViT model (2.5GB), it would be incredibly slow if the model had to be loaded every time a user wants to check for a match. Therefore, we load the ViT model used to generate embeddings and the DETR object detection model when the API server starts up.
When a user uploads an image it is passed to the object detection model which determines bounding boxes for the class “dog”. If any of the “dog” bounding boxes have a probability of at least 0.9 of containing a dog, we crop the image at the highest probability dog bounding box and the model the returns the original image and the cropped image. If there is no bounding box that meets the threshold, only the original, un-cropped image is returned.
The ViT model then generates embeddings for the image, which are passed on to the FAISS part of our code. We create a FAISS index of lost dogs and another one of found dogs and perform the search against the appropriate index depending on whether the dog was listed as a lost or found dog. We normalize the vector to be able to perform a cosine similarity search and then query for the closest 5 matches (k=5 nearest-neighbors), considering any vector in the index with a cosine similarity of at least 0.7 to be a match for the uploaded image.
At this point the matching dog information is returned to the end user so they can get in touch to try to reunite the lost dog with its owner. The data they receive comes from prior users uploading images and contact info which the app has saved to disk the appropriate csv file, either for lost dog uploads or for found dogs uploads.
API Data Saving Task
So as not to slow down the return of information to the frontend, we first return the data for any matches or the the response “No Matching Dogs Found” and then proceed to save the current vector embedding, image, and contact info using FastAPI’s BackgroundTasks, which allows tasks to be set to run in the background so they don’t delay the API in returning data to the frontend.
Deployment
In terms of deployment, we have used Ansible Playbooks to be able to automate deployment of the app to both a single Google Cloud Platform (GCP) VM and to a Kubernetes cluster within GCP. While a single VM is sufficient for light use of the app, the Kubernetes cluster ensures that the app is able to handle more strenuous loads.
Our application is fully containerized and the containers can be run:
- Locally in development mode;
- On a VM that’s manually started in GCE (other cloud providers can be used if credentials and params are changed;
- On a VM that’s started using Ansible playbooks from deployment container;
- On a K8s cluster that’s started using Ansible playbooks from deployment container.
We have also implemented a CI/CD workflow using GitHub actions, if push request contains a deployment trigger string, it re-uploads images to Google Cloud Container Registry and deploys them to the K8s cluster.
Our CI/CD pipeline works in a similar way to support the data export-transformation-preprocessing-processing flow.
Next Steps
Our app can be used as-is to help dog owners and concerned spotters reunite lost dogs with their owners. With that said, in working on the project we have come across several areas for improvement that would help make the app more reliable and impactful.
- Ability for users to have accounts where they can manage a library of images for each of their dogs and to bulk upload images of a dog the have spotted, to aid in matching
- Messaging service so both the dog owner and the person who have seen a dog are notified when there is a possible match, with a way for them to get in touch through the app
- Datetime information so the app only attempts to match dogs who have a found date after the lost date to increase accuracy
- Geolocation data to weight the probability a dog is a match based on how likely it is that a dog lost in location A could be found in location B
- Model improvements for vector embedding generation to make the similarity search more accurate
- Use semantic segmentation instead of object detection to eliminate background items that can through off the vector embedding generation
- Expand app to other pets, especially cats that our research shows are significantly more likely to not be returned to their owner once they are lost than dogs (46% vs 3%)
Acknowledgements
We are grateful to Jonathan Sessa and Sunil Chomal who contributed to both the code and to the ideas behind the project. Their contributions are greatly valued.
We want to thank Rashmi Banthia who was our advisor for the project for her support and advice during the course.
We also want to thank Shivas Jayaram for providing us with the Austin Pets Alive dataset that was extremely helpful for us during our experiments for the project.
References
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., … & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929v2. Retrieved from https://arxiv.org/abs/2010.11929v2
- Weiss, E., Slater, M., & Lord, L. (2012). Frequency of Lost Dogs and Cats in the United States and the Methods Used to Locate Them. Animals : an open access journal from MDPI, 2(2), 301–315. https://doi.org/10.3390/ani2020301