Need for speed with mongoDB and Python

xkcd 1701

I had come across a very interesting problem when I was trying to learn mongoDB, and it involved removing images that were not a part of albums. There were two collections in the database and here is how the dataset looked liked:

albums collection:
{
“_id” : 0,
“images” : [
2433,
2753,
2983,
6510,
11375,
12974,
15344,
16952,
19722,
23077,
24772,
31401,
32579,
— -
]
}
images collection:
{
“_id” : 1,
“height” : 480,
“width” : 640,
“tags” : [
“cats”,
“sunrises”,
“kittens”,
“travel”,
“vacation”,
“work”
]
}

There were total 100000 documents in the images dataset 1000 documents in the albums dataset.

I used the pymongo python driver to solve the problem, but surprisingly it took me a lot of time to find the orphaned images and delete them. The following piece of python code will tell you if an image is orphan or not.

def is_image_orphan(image_id):
“””Method to check if the image is orphan”””
return albums.find({“images”:image_id}).count() == 0

The method simply used mongoDB’s find command. A count of zero meant that the image is not a part of any albums.

I wrote another method to get a set of all image_ids in the images collection:

def get_all_images():
“””Method to return all the images as image_id in a set “””
return {image_id[‘_id’] for image_id in images.find({},{“_id”:1})}

And a method that takes an album_id as input and remove the image with that image_id from the images collection:

def remove_image_from_db(image_id):
“””Method to remove the image from the collection”””
images.delete_one({“_id”:image_id})

Now without user defined index, let’s try to remove all the orphaned images from the images dataset. Spoiler alert: there are approximately 11000 orphaned images.

>>> start_time = timeit.default_timer()
>>> all_images = get_all_images()
>>> for image in all_images:
>>> if is_image_orphan(image):
>>> remove_image_from_db(image)
>>> elasped = timeit.default_timer() — start_time
>>>print “elasped_time: {}”.format(elasped)
elasped_time: 481.696001053

Whoa!! That’s horrible. It took me a whopping 482 seconds to find orphaned images and delete them from the collection. That was my first and worst solution and I was definitely not happy with it. It was pretty much clear that the query to find the image in albums collection is taking a long time. I had already heard the benefits of indexes when learning about mongoDB, and I decided to create an index on images array of the albums collection.

On the mongo shell:

> db.albums.createIndex({“images”:1})

And running the same code:

>>> elasped_time: 95.6661539078

This time the same code took me around 96 seconds, which is better but definitely not really fast.

Finally, I decided to create a set of all the un-orphaned images, and take a set difference between all_images and unorphaned_images
I used the concept of aggregation and used the $unwind operator to de-compress the `images` array into a single key:value pair. Here is the documentation snippet of the unwind operator from the mongoDB documentation:

Deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

The code to get all the un-orphaned images as a set is as follows:

def get_unorphaned_images():
“””Method to return the unorphaned_images as set”””
cursor = albums.aggregate([{“$unwind”:”$images”}])
return {image_id for document in cursor for id, image_id in document.items()}

The final code to use the new `get_unorphaned_images()` is as follows:

>>> start_time = timeit.default_timer()
>>> all_images = get_all_images()
>>> unorphaned_images = get_unorphaned_images()
>>> for image_id in all_images.difference(unorphaned_images):
remove_image_from_db(image_id)
>>> elasped = timeit.default_timer() — start_time
>>> print “elasped_time: {}”.format(elasped)

Running the code without any index, and the elapsed time was:

>>> elasped_time: 13.6858859062

Creating the index again, and then running the same code, I got a slightly better result.

>>> elasped_time: 9.25210785866

So finally, from 482 seconds, I was able to solve the problem in around 9 seconds, all with the help of better read strategy from python and the use of indexes. Now I will be able to sleep in peace!

Thank you for reading. If you enjoyed it, do press the heart and spread some love. If you think there is any way to make it even faster, hit the comments section.