Case study for programmers / Checklist for ninja developers
Let me walk you through a task assigned to me recently. I had a list of 250 K images which were to be migrated from one storage solutions provider to another.
The steps can be broken down into:
- Request image
- Save locally
- Upload to new server
- Delete the local file
Pretty much straight forward, I programmed first version of sequential image upload and profiled it.
I had a clue that it WILL take a lot of time. A single end to end image migration took around 1.3 seconds to complete, which means total time required equals 1.3s x 250k = 3.7 days. Wait, what? Obviously, can’t wait for 3 days until the work gets done. So,
Improvisation #1: Wrap It In Threads
If the process is long and cannot be shortened, then make it run in parallel. Not so difficult
Multi threading eased up the waiting period. Now, 5 files were processed on an average in one second (0.2 second per image) which brings down the total time to 14 hours. But still, I was curious if it can be brought down even further.
Improvisation #2 : Understanding Bottlenecks
This is not a CPU intensive task, it is network intensive… If I visualise data flow, data is first downloaded on my PC from a remote server which is hosted in my country, and then re-uploaded from my PC to a remote server which is again hosted in the same country. Every packet of data is flowing through my ISP to my machine and then back, limited by latency and connection speed. What if this is carried on an independent remote server itself? I spun up an AWS EC2 instance and carried this operation there. Now, 50 images were uploaded in one second, total time slides down to 85 minutes. Yippee!
Something Is Still Not Right…
The throughput I observed was 50 images per second in the starting. It comes to 40… 30… 15… 5 when sufficient images are uploaded. I inspect the code and find out that the list of products is a “list”. On crossing 30–40k images, this list becomes huge and every append operation’s cost goes up incrementally. Therefore,
Improvisation #3 : Understand Data Structures (List vs Map vs Set)
Try to use maps (dictionaries in Python) over lists or arrays. Since I didn’t need to store any additional data, I could go with a “set” of completed product ids.
Improvisation #4 : Handling Exceptions
Out of all these images, there may be a few cases when the link is broken, or the image is corrupt, server throws 500 code at you, then it has to be handled gracefully. I don’t want a situation where I am at 99% completion and an ugly exception occurs.
Keep a track of images which face errors and inspect later.
But wait, it is not over yet…
The Uploader class has a method start which starts uploading the images and a method dump_file which dumps the result into a file. After I complete the upload process, I have to manually call dump_file to finish the task. Seriously, even I don’t know why didn’t I call it automatically after the uploads are done. Problem it caused : After I call start, the internet connection went for a toss and I can’t access my session after reconnecting to the remote instance.
Improvisation #5 : Use Detachable Session If Needed (In Remote Connection)
Next time, I start a new virtual session with screen, start the upload, detach and exit.
Return after a short stroll, connect and resume the session.
Improvisation #6 : Keep Important Data Always Accessible
Consider this piece of code which can be executed from command line
Invoke the script from terminal
python counter.pyTraceback (most recent call last):
File "counter.py", line 6, in <module>
raise ValueError('Something went wrong')
ValueError: Something went wrong
What if I need the residual value of count after the crash?
Now, consider this design
python>>> from counter import Counter
>>> counter = Counter()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/raman/workspace/counter.py", line 10, in start_counting
raise ValueError('Something went wrong')
ValueError: Something went wrong>>> counter.count
This is the reason I had encapsulated everything inside a class.
Exception handling generally takes care of these scenarios but preparing for worst case always helps!
Some of you may argue that a class-less script can also be executed from python console and access global variables but personally, I don’t like that design.
Improvisation #7 : Credentials / Secret Keys Should Be Separate
Secrets should be kept secret right?
Well, the task had finally come to an end and I was able to wrap it up within a day itself, but the process taught me a lot. I thought it would be good if I document it well and someone could make use of it.
AVIACOMMERCE: OPEN SOURCE E-COMMERCE
Aviacommerce is an open source e-commerce framework with an aim to simplify e-commerce ecosystem by creating and unifying smart services and providing consistent, cost-effective solution to sellers helping them sell more. visit aviacommerce.org to know more.