Regrettable Code, Episode 1
It’s easy to talk about pretty code. Let’s talk about nasty code and see what we can learn from it.
The product I was working on has a concept of an
Account. On a given
Account there may be one-to-many
Teams to which users belong. There was a RESTful API that allowed querying various settings of
Team objects of the form
/api/v1/features/team-21 would return all features of team 21. There was also a form of the endpoint that would return all teams and their respective settings:
In a nutshell, on some accounts with many teams, querying for
team-21 was slow. Querying for
team-* was exceptionally slow — for an account with 500ish teams, it would easily take ten to fifteen seconds to return a response.
My initial gut was: someone probably stuck a database query in a loop or something silly like that, and I’d fix it with a simple change to make a single database call. To confirm this hypothesis, I did a quick profile.
This code was in python, so
cProfile was a quick way to get some timings against a production account. A call to
cProfile.run('features = _inner_fetch_all_features("team-*", account_id)') gave me results. And the results were very, very unexpected.
The first profile returned in 11.935 seconds. The first thing that leapt out was this code was spending an extraordinary amount of time in some class called
ncalls tottime percall cumtime percall filename:lineno(function)...11825 0.049 0.000 0.060 0.000 attrdict.py:26(__getattr__)
384912/432 1.762 0.000 7.012 0.016 attrdict.py:39(deepcopy)
454032/64800 0.650 0.000 6.943 0.000 attrdict.py:49(<genexpr>)
23650 0.023 0.000 0.023 0.000 attrdict.py:58(__setstate__)
…roughly 59% of the entire time spent was in
AttrDict. Whoa. Did not expect. The other item that was fishy involved a library called
46859 0.178 0.000 0.347 0.000 pickle.py:1033(load_dict)
3 0.000 0.000 0.000 0.000 pickle.py:1093(load_global)
3 0.000 0.000 0.000 0.000 pickle.py:1128(find_class)
11825 0.032 0.000 0.088 0.000 pickle.py:1135(load_reduce)
225226 0.497 0.000 0.797 0.000 pickle.py:1156(load_get)
106928 0.180 0.000 0.258 0.000 pickle.py:1170(load_put)
175038 0.393 0.000 0.626 0.000 pickle.py:1199(load_setitem)
11825 0.091 0.000 0.271 0.000 pickle.py:1217(load_build)
58684 0.081 0.000 0.117 0.000 pickle.py:1256(load_mark)
1 0.000 0.000 0.000 0.000 pickle.py:1260(load_stop)
1 0.011 0.011 4.788 4.788 pickle.py:1386(loads)
1 0.000 0.000 0.000 0.000 pickle.py:83(__init__)
1 0.000 0.000 0.000 0.000 pickle.py:835(__init__)
1 1.183 1.183 4.777 4.777 pickle.py:851(load)
58684 0.094 0.000 0.132 0.000 pickle.py:876(marker)
88873 0.230 0.000 0.348 0.000 pickle.py:917(load_int)
2 0.000 0.000 0.000 0.000 pickle.py:967(load_string)
36414 0.116 0.000 0.168 0.000 pickle.py:985(load_unicode)
11825 0.027 0.000 0.056 0.000 pickle.py:999(load_tuple)
The remaining 40% was spent in
pickle , which was some library for serializing/deserializing python objects.
I was a bit stunned at this point; not a single slow spot involved database code, which meant this was purely some algorithmic problem in python. And for some reason, it was slow when there were more teams.
Those Still Waters Run Deep
I decided the first thing I’d go after was
AttrDict.deepcopy , since that was the slow spot. This also turned out to be a poor assumption, because I didn’t fully understand what this code was doing and why, but we’ll get to that shortly.
AttrDict looked like so (I’ve removed fluff):
Dict that allows attribute access
def __getattr__(self, name):
return super(AttrDict, self).__getattribute__(name)
def __setattr__(self, name, value):
self[name] = value
def deepcopy(cls, other):
if isinstance(other, basestring):
elif isinstance(other, collections.Sequence):
assert isinstance(other, list)
return [AttrDict.deepcopy(item) for item in other]
elif isinstance(other, collections.Mapping):
assert isinstance(other, (AttrDict, dict))
for key, value in other.iteritems()
AttrDictwas basically a python
some_attrdict['blah'], you could write
some_attrdict.blah. This code was likely ripped off from stack overflow somewhere, since whoever wrote this wasn’t the first person to desire this functionality.
The second part was
deepcopy, the offending code. At first glance, it took some object called
other and would recursively create a deep copy of that. One interesting side-effect of this was supplied
dict objects would be turned into
The other interesting part was: this code was using
isinstance, and abusively.
isinstance is pricy, because it needs to check an object’s inheritance hierarchy to see if the supplied object — or some child class — is of the supplied type.
There was also some pointless asserting (footnote: our asserts run in production. None of these had fired as far back as I had log history for, so they are worthless), and a nonsensical if/else structure.
So win. I can rewrite this to improve performance and that may help. I changed to:
def deepcopy(cls, other):
other_type = type(other)
if other_type in [dict, AttrDict]:
for key, value in list(other.items())
elif other_type is list:
return [AttrDict.deepcopy(item) for item in other]
…there was no need to check for string types because they followed the same path as the
And rather than use
isinstance each time, use
type once and check for specific types I knew were supported. This change was not without risk: it meant I could no longer properly handle inherited objects like the code previously did. To mitigate this, I confirmed this updated code delivered the same results in production.
After configuring prod to re-run this code so I could get a clean profile, the bulk of
AttrDict now consumed roughly two seconds. Not great, but acceptable. Given this was less than the other offending chunk of code, I decided to move on to
This Must Be The Pickle
First, I found where we were using pickle. There was a chunk of code that would fetch account features, that cached them to avoid regenerating:
cache_key = _get_cache_key(account_id)
account_features = cache.get(cache_key)
account_features = AttrDict(pickle.loads(zlib.decompress(account_features)))
account_features = _split_features(_query_features(account_id))
A basic overview:
- Try to fetch features out of a cache.
- If they exist, decompress with zlib, hydrate them with pickle, and return those.
- If they don’t exist, re-generate them, dehydrate with pickle, compress with zlib, save to the cache and return them.
After reading this code, I had more questions than answers. A few thoughts/ideas that went through my head:
- The whole reason we were using
picklewas definitely because of
AttrDict. At some point, someone decided we needed to cache this stuff. But because we were using
AttrDict, we needed a way to serialize/deserialize python objects.
- The use of zlib was odd. After consulting a coworker, I learned this was because a cache object had a 1 MB limit, and for accounts with many teams, we had exceeded that limit. So zlib compression was added as a quick stop-gap measure.
The second part kinda floored me; what the heck was pickle persisting that would bump into a 1 MB limit? Also, if all we were ultimately storing was dict/list objects, what did it matter if we used
pickle? We could serialize with
I did some quick terminal hacking to see how JSON would compare:
In : json_compressed = zlib.compress(json.dumps(account_features)) In : pickle_compressed = zlib.compress(pickle.dumps(account_features))In : len(pickle_compressed)
Out: 642794 In : len(json_compressed)
…good lord. The compressed payload with
pickle is twelve times the size. But it gets worse:
In : start = time.clock()
...: json_compressed = zlib.compress(json.dumps(account_features))
...: print time.clock() - start
0.035571 In : start = time.clock()
...: pickle_compressed = zlib.compress(pickle.dumps(account_features))
...: print time.clock() - start
…working with a JSON payload was exceptionally faster — roughly 35x for hydrating/dehydrating the payload. Swapping from
json alone would solve this problem.
But one major snag: when taking an
AttrDict , dumping to JSON, compressing, decompressing and then loading with JSON, we would lose
AttrDict entirely — it’d come back as
AttrDict. And updating vast swaths of our codebase that used
AttrDict was out of the question. I’d never make it through QA in a reasonable amount of time, nor would I be able to have confidence in such a massive change.
And then I remembered
deepcopy. I could do this:
account_features = AttrDict.deepcopy(json.loads(zlib.decompress(account_features)))
…when rehydrating, and that would automatically convert
AttrDict. Problem solved. And adding one additional
deepcopy operation was irrelevant, even though that was the previous offending code. We were already doing hundreds of them (see profile above) and this codepath was called only once.
Edit: a redditor pointed out rather than using
deepcopy, it would be preferable to use
json.loads( zlib.decompress(account_features), object_hook=AttrDict)
…this was roughly 5x better than the
deepcopy strategy. Thank you!
Except Not Solved :/
At this point, it was lightning fast for accounts with small team counts, and 2–3 seconds for accounts with 500ish teams. So I created an account with a thousand teams to get a sense of performance.
I was right back at ten seconds for
It was this moment that I put it together; my problem with this endpoint wasn’t bad python code, although that was a contributing factor. It was that I had a big-O problem. If 500 took me 2.5 seconds, and a 1000 took me 10 seconds, that mapped almost perfectly to
O(n^2). The underlying algorithmic problem still existed, and it grew with the addition of teams.
Unfortunately, at this point I had to acknowledge one other frustrating reality of engineering. My improvements were already good enough to get this work through QA. We didn’t have any accounts with more than 500 teams. I asked our QA people to create a new defect for 1000 teams to track the issue moving forward, and closed out the bug ticket.
As much as I wanted to figure it out and fix it, perfect is the enemy of good enough. Adios, amigo.
Git History, A Few Take-Aways, A Few Random Musings
I did, however, dig through the git history to figure out what the heck the deal was with
- It was written five years prior by a junior developer. The intent was definitely just convenience — the ability to do
some_attr_dict['blah']. The bad if/else logic and nonsensical asserting also made more sense given a junior developer. At this time, however, the codebase was small and there probably was little consequence to
- A few days later, a more senior developer added the ability to
picklethis object so we could start caching it. The caching was added to address performance issues. It’s unclear what those performance issues were; there was no clear history on why this was done.
- At some point, when the cache exploded as we gained accounts with more and more teams, some other developer added
zlibcompression to the caching. I can’t fault this decision; it was an emergency fix when stuff started exploding in production.
My takeaways from all of this:
- I think adding a wrapper around basic python types like
listshould be avoided. That decision had serious consequences down the line, for what appears to be nothing more than syntactic sugar.
- It is unfortunate the senior developer didn’t push the junior developer towards simplicity for the sake of simplicity. Although, that brings up:
- It is unfortunate our code review process did not exist at the time
AttrDictwas written. The original code is the sort of stuff I would have asked be rewritten or justified.
- It is unfortunate the performance issues that warranted
pickleweren’t clearly communicated in git history, or some linked defect in an issue tracker. It makes me understand why having clear guidelines around commit messages is so vitally important for large, long-term codebases.
- This is an interesting study in the long-term consequences of a programming choice. It is clear to me this is one of those places where “clever” code had serious consequences and the business would have been better off without it.
- It is clear to me as a senior developer, it is important to mentor junior developers towards the simplest thing that could possibly work.
And some other random, final musings:
- It turns out there are heaps of language SNOOTS out there who have some pretty involved discussions about
isinstance; Bjarne Stroustrup even argued against including such a feature in C++, which is kinda mind-blowing since that language can’t seem to find a bell or whistle that shouldn’t be included.
pepguidelines implicitly push people towards
isinstance, and that is arguably not a great thing.
isinstanceisn’t particularly performant, and it absolutely is not the same as
type. In digging through git history, I found a merge request where a senior developer went through our codebase mindlessly replacing
type, and string handling suck due to python 2 -> python 3 conversion issues. For a while I was banging my head on
deepcopystring handling until I realized I didn’t need to do it at all.