Working with Large-Scale Object Detection Datasets in Computer Vision
We are living in the age of big data today. New datasets are released every now and then. Moreover, datasets’ size is growing too. While this is the dream of practicing researchers come true, it comes with some inherent challenges too.
Different datasets differ in a number of elements. They may
— differ in the storage format of their annotations and images.
— differ in the nature of metadata provided for each image or annotation.
Most datasets come with their own supporting code base for their parsing. Unfortunately, this may prove to be inflexible, often leading to long and boring schedules of pre-processing and data handling. Let us look at a specific example from my own experience to reinforce this point :
I work in the area of pedestrian detection. During experimentation often I would like to select a subset of annotations which satisfy certain criteria ( e.g:- pedestrians falling in a certain height and occlusion range). Many times, I would like to mix different subsets of annotations from different datasets for experimentation. No parser is functional enough to fully meet all these requirements. When I sometimes experiment with more general object categories, this can get unwieldy when I need to select a very specific subset of annotations.
I faced this problem so many times, that I decided to solve this problem once and for all. I decided to separate images and annotations altogether and refrain from maintaining and using large number of different parsers. In this direction, I was led to the use of modern database systems for storing and querying relevant annotations. I could finally handle multiple big datasets and not care about their individual formats all the time. I would like to share my solution with you.
The database system I selected finally is MongoDB. MongoDB is a NoSQL database system. There is no need to design an elaborate schema before storing anything. There is no need to homogenize the data. This is especially important — datasets vary widely and it is quite impossible to come up with one single schema which suits all datasets. In MongoDB, collections of documents can be stored. Documents in a collection need not be homogenous. MongoDB offers fast writes and reads, is easy to use from Mongo shell as well as Python (using PyMongo). The approach is as follows :
For each dataset I determine the following :
a) What information I need to store for each annotation ? This is easy to do by properly understanding your use case as well as I do. If you are not sure, you can go ahead and store every piece of information provided in the annotation. Remember, MongoDB can store heterogenous data very well.
b) I create a collection for a specific dataset. If you are familiar with traditional databases, a collection is an analogue of a table.
c) Documents in a collection are analogues of rows in a RDBMS table. Each document to me is a specific annotation.
d) I write a code once to parse the data and write it to MongoDB and my job is finished.
So, in simple words, I take the pain of understanding the annotation format of a dataset only once and never again. I can focus on actual research hence.
In the rest of this article I will describe the architecture of the codebase I wrote for this purpose. If you are impatient, you can check out the code right here.
I used Python for this purpose. I started by creating an abstract base class. This just facilitates other people in extending the code for their own datasets. The abstract base class (ABC), offers a template which can be used to create new classes. The ABC in my case is really simple and extends over a small number of lines :
from abc import ABC, abstractmethod
from utils.gen_utils import *
from pymongo.errors import BulkWriteError
def __init__(self, name):
Initializer for the BaseClass
:param name: Name of the dataset
if not name:
raise ValueError('The name of the dataset must be provided.')
self._name = name
An abstract method. This method should collect all the annotations of
interest into a list of dictionaries. Each dictionary represents a
document and the list as a whole represents a collection.
:return: A list of python dictionaries
def write2mongo(self, hostname='localhost', port=27017):
Writes a collection of documents to MongoDB.
:param hostname: Name of the host where the mongod server is running
:param port: Port number.
info = self.collect_annotations()
client = connect2mongo(hostname, port)
db = getdb(client, 'CVDatasets')
collection = db[self._name]
raise BulkWriteError('There were problems writing the documents.')
So, someone who wants to extend the code to a new dataset needs to implement only one function in their class
collect_annotations(). The docstring is quite clear and hence, I would not bore you with an elaboration of the codebase. Just one reminder:
It makes use of some utility functions and you can see them in the repository here. They are also fully documented and are easy to understand
I have not talked about MongoDB in detail here. It is actually quite simple to install it. Before you use MongoDB, you need to start the MongoDB server which continuously runs in the background. If you run it locally on the same machine, then the
hostname for MongoDB server is
localhost. If not, you will need to provide a
Once you have added the data to MongoDB, you can very easily use them in your training/testing script, by using PyMongo. All you would need to provide is a PyMongo query for which there are plenty of good examples out there.
So, once you do this, you save yourself a mountain of time in handling annotations. You do it once and forget about annotation logistics for all.
Isn’t that a better approach ?
NOTE : This is my first post. Do not hesitate in pointing out other better practices. Do not hesitate in letting me know if there is any additional information I should be sharing here with the people.