Regrettable Code, Episode 1

It’s easy to talk about pretty code. Let’s talk about nasty code and see what we can learn from it.


The product I was working on has a concept of an . On a given there may be one-to-many to which users belong. There was a RESTful API that allowed querying various settings of and objects of the form


For example, would return all features of team 21. There was also a form of the endpoint that would return all teams and their respective settings:

The Problem

In a nutshell, on some accounts with many teams, querying for was slow. Querying for was exceptionally slow — for an account with 500ish teams, it would easily take ten to fifteen seconds to return a response.

Let’s Profile!

My initial gut was: someone probably stuck a database query in a loop or something silly like that, and I’d fix it with a simple change to make a single database call. To confirm this hypothesis, I did a quick profile.

This code was in python, so was a quick way to get some timings against a production account. A call to gave me results. And the results were very, very unexpected.

The first profile returned in 11.935 seconds. The first thing that leapt out was this code was spending an extraordinary amount of time in some class called :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)...11825    0.049    0.000    0.060    0.000
384912/432 1.762 0.000 7.012 0.016
454032/64800 0.650 0.000 6.943 0.000<genexpr>)
23650 0.023 0.000 0.023 0.000

roughly 59% of the entire time spent was in . Whoa. Did not expect. The other item that was fishy involved a library called :

46859    0.178    0.000    0.347    0.000
3 0.000 0.000 0.000 0.000
3 0.000 0.000 0.000 0.000
11825 0.032 0.000 0.088 0.000
225226 0.497 0.000 0.797 0.000
106928 0.180 0.000 0.258 0.000
175038 0.393 0.000 0.626 0.000
11825 0.091 0.000 0.271 0.000
58684 0.081 0.000 0.117 0.000
1 0.000 0.000 0.000 0.000
1 0.011 0.011 4.788 4.788
1 0.000 0.000 0.000 0.000
1 0.000 0.000 0.000 0.000
1 1.183 1.183 4.777 4.777
58684 0.094 0.000 0.132 0.000
88873 0.230 0.000 0.348 0.000
2 0.000 0.000 0.000 0.000
36414 0.116 0.000 0.168 0.000
11825 0.027 0.000 0.056 0.000

The remaining 40% was spent in , which was some library for serializing/deserializing python objects.

I was a bit stunned at this point; not a single slow spot involved database code, which meant this was purely some algorithmic problem in python. And for some reason, it was slow when there were more teams.

Those Still Waters Run Deep

I decided the first thing I’d go after was , since that was the slow spot. This also turned out to be a poor assumption, because I didn’t fully understand what this code was doing and why, but we’ll get to that shortly.

looked like so (I’ve removed fluff):

class AttrDict(dict):
Dict that allows attribute access
def __getattr__(self, name):
return self[name]
except KeyError:
return super(AttrDict, self).__getattribute__(name)

def __setattr__(self, name, value):
self[name] = value

def deepcopy(cls, other):
if isinstance(other, basestring):
return other
elif isinstance(other, collections.Sequence):
assert isinstance(other, list)
return [AttrDict.deepcopy(item) for item in other]
elif isinstance(other, collections.Mapping):
assert isinstance(other, (AttrDict, dict))
return AttrDict(
(key, AttrDict.deepcopy(value))
for key, value in other.iteritems()
return other

So was basically a python with some flair. One piece of flair was the ability to access the dictionary members with a dot syntax similar to javascript. So instead of , you could write . This code was likely ripped off from stack overflow somewhere, since whoever wrote this wasn’t the first person to desire this functionality.

The second part was , the offending code. At first glance, it took some object called and would recursively create a deep copy of that. One interesting side-effect of this was supplied objects would be turned into .

The other interesting part was: this code was using , and abusively. is pricy, because it needs to check an object’s inheritance hierarchy to see if the supplied object — or some child class — is of the supplied type.

There was also some pointless asserting (footnote: our asserts run in production. None of these had fired as far back as I had log history for, so they are worthless), and a nonsensical if/else structure.

So win. I can rewrite this to improve performance and that may help. I changed to:

def deepcopy(cls, other):
other_type = type(other)
if other_type in [dict, AttrDict]:
return AttrDict(
(key, AttrDict.deepcopy(value))
for key, value in list(other.items())
elif other_type is list:
return [AttrDict.deepcopy(item) for item in other]
return other

…there was no need to check for string types because they followed the same path as the path: .

And rather than use each time, use once and check for specific types I knew were supported. This change was not without risk: it meant I could no longer properly handle inherited objects like the code previously did. To mitigate this, I confirmed this updated code delivered the same results in production.

After configuring prod to re-run this code so I could get a clean profile, the bulk of now consumed roughly two seconds. Not great, but acceptable. Given this was less than the other offending chunk of code, I decided to move on to .

This Must Be The Pickle

First, I found where we were using pickle. There was a chunk of code that would fetch account features, that cached them to avoid regenerating:

def _all_features(account_id):
cache_key = _get_cache_key(account_id)
account_features = cache.get(cache_key)
if account_features:
account_features = AttrDict(pickle.loads(zlib.decompress(account_features)))
except (...):
return account_features

account_features = _split_features(_query_features(account_id))
cache.set(cache_key, zlib.compress(pickle.dumps(account_features)))
return account_features

A basic overview:

  1. Try to fetch features out of a cache.
  2. If they exist, decompress with zlib, hydrate them with pickle, and return those.
  3. If they don’t exist, re-generate them, dehydrate with pickle, compress with zlib, save to the cache and return them.

After reading this code, I had more questions than answers. A few thoughts/ideas that went through my head:

  1. The whole reason we were using was definitely because of . At some point, someone decided we needed to cache this stuff. But because we were using , we needed a way to serialize/deserialize python objects.
  2. The use of zlib was odd. After consulting a coworker, I learned this was because a cache object had a 1 MB limit, and for accounts with many teams, we had exceeded that limit. So zlib compression was added as a quick stop-gap measure.

The second part kinda floored me; what the heck was pickle persisting that would bump into a 1 MB limit? Also, if all we were ultimately storing was dict/list objects, what did it matter if we used ? We could serialize with .

I did some quick terminal hacking to see how JSON would compare:

In [35]: json_compressed = zlib.compress(json.dumps(account_features))  In [36]: pickle_compressed = zlib.compress(pickle.dumps(account_features))In [37]: len(pickle_compressed) 
Out[37]: 642794
In [38]: len(json_compressed)
Out[38]: 50585

…goodness gracious. The compressed payload with is twelve times the size. But it gets worse:

In [41]: start = time.clock()
...: json_compressed = zlib.compress(json.dumps(account_features))
...: print time.clock() - start
In [42]: start = time.clock()
...: pickle_compressed = zlib.compress(pickle.dumps(account_features))
...: print time.clock() - start

…working with a JSON payload was exceptionally faster — roughly 35x for hydrating/dehydrating the payload. Swapping from to alone would solve this problem.

But one major snag: when taking an , dumping to JSON, compressing, decompressing and then loading with JSON, we would lose entirely — it’d come back as , not . And updating vast swaths of our codebase that used was out of the question. I’d never make it through QA in a reasonable amount of time, nor would I be able to have confidence in such a massive change.

And then I remembered . I could do this:

account_features = AttrDict.deepcopy(json.loads(zlib.decompress(account_features)))

…when rehydrating, and that would automatically convert to . Problem solved. And adding one additional operation was irrelevant, even though that was the previous offending code. We were already doing hundreds of them (see profile above) and this codepath was called only once.

Edit: it was pointed out that rather than using , it would be preferable to use 's feature:

json.loads( zlib.decompress(account_features), object_hook=AttrDict)

…this was roughly 5x better than the strategy. Thank you!

Except Not Solved :/

At this point, it was lightning fast for accounts with small team counts, and 2–3 seconds for accounts with 500ish teams. So I created an account with a thousand teams to get a sense of performance.

I was right back at ten seconds for .

It was this moment that I put it together; my problem with this endpoint wasn’t bad python code, although that was a contributing factor. It was that I had a big-O problem. If 500 took me 2.5 seconds, and a 1000 took me 10 seconds, that mapped almost perfectly to . The underlying algorithmic problem still existed, and it grew with the addition of teams.

Unfortunately, at this point I had to acknowledge one other frustrating reality of engineering. My improvements were already good enough to get this work through QA. We didn’t have any accounts with more than 500 teams. I asked our QA people to create a new defect for 1000 teams to track the issue moving forward, and closed out the bug ticket.

As much as I wanted to figure it out and fix it, perfect is the enemy of good enough. Adios, amigo.

Git History, A Few Take-Aways, A Few Random Musings

I did, however, dig through the git history to figure out what the heck the deal was with and :

  1. It was written five years prior by a junior developer. The intent was definitely just convenience — the ability to do instead of . The bad if/else logic and nonsensical asserting also made more sense given a junior developer. At this time, however, the codebase was small and there probably was little consequence to
  2. A few days later, a more senior developer added the ability to this object so we could start caching it. The caching was added to address performance issues. It’s unclear what those performance issues were; there was no clear history on why this was done.
  3. At some point, when the cache exploded as we gained accounts with more and more teams, some other developer added compression to the caching. I can’t fault this decision; it was an emergency fix when stuff started exploding in production.

My takeaways from all of this:

  1. I think adding a wrapper around basic python types like and should be avoided. That decision had serious consequences down the line, for what appears to be nothing more than syntactic sugar.
  2. It is unfortunate the senior developer didn’t push the junior developer towards simplicity for the sake of simplicity. Although, that brings up:
  3. It is unfortunate our code review process did not exist at the time was written. The original code is the sort of stuff I would have asked be rewritten or justified.
  4. It is unfortunate the performance issues that warranted weren’t clearly communicated in git history, or some linked defect in an issue tracker. It makes me understand why having clear guidelines around commit messages is so vitally important for large, long-term codebases.
  5. This is an interesting study in the long-term consequences of a programming choice. It is clear to me this is one of those places where “clever” code had serious consequences and the business would have been better off without it.
  6. It is clear to me as a senior developer, it is important to mentor junior developers towards the simplest thing that could possibly work.

And some other random, final musings:

  1. It turns out there are heaps of language SNOOTS out there who have some pretty involved discussions about ; Bjarne Stroustrup even argued against including such a feature in C++, which is kinda mind-blowing since that language can’t seem to find a bell or whistle that shouldn’t be included.
  2. guidelines implicitly push people towards , and that is arguably not a great thing. isn’t particularly performant, and it absolutely is not the same as . In digging through git history, I found a merge request where a senior developer went through our codebase mindlessly replacing with because said so.
  3. , , and string handling suck due to python 2 -> python 3 conversion issues. For a while I was banging my head on string handling until I realized I didn’t need to do it at all.

There are no long moves.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store