Note: Now find the make_flash() function in im-util, repo is here: https://github.com/emlynoregan/im_util

Flash! Calculating a Function/Lambda Hash in Python

In a previous article I presented a @debouncedtask:

You use @debouncedtask to debounce a function (ie: basically add a smart delay to cope with it being called too often), and run it as a separate task on App Engine. The code might look like this:

# I call this a lot!
def useful_but_messy():
# useful but messy stuff
cleanup()
# hmm, but I shouldn't call this so often
@debouncedtask(initsec=10, repeatsec=60
def cleanup():
# clean up that mess

That code makes sure that cleanup is only called once every 60 seconds, and is delayed until 10 seconds after the most recent previous invocation.

What I explained briefly, but didn’t go into, was how @debouncedtask knows what to debounce. For instance, say I have two debounced functions:

@debouncedtask
def f1():
# ...

@debouncedtask
def f2():
# ...

or what if I want to debounce the same function separately for every value of an argument:

def useful_but_messy(userid):
# useful but messy stuff about a user
cleanup(userid) # we want to do this separately
# per user because reasons
@debouncedtask
def cleanup(userid):
# clean up that user's mess

or even if I want to debounce it separately based on a variable in the closure?

def useful_but_messy(userid):
@debouncedtask
def cleanup():
# clean up that mess
user = getuserbyid(userid)

# useful but messy stuff about a user
cleanup() # don't need to pass the userid

@debouncedtask handles all of this transparently. To do this, it needs to generate an id that uniquely identifies a combination of function + closure + args, which it then uses to store timing data in memcache.

It has access to the function object (which includes closure info) and the args & kwargs. We could use the function name and module to identify the function (f.__name__ etc), but this isn’t reliable; the __name__ can be missing or wrong or misleading for many reasons.

We can do something more cunning to identify the function + closure + args, which is to use pickling (ie: serialisation). Recall from @task that we serialise & deserialise these entities to get them from one process to another. @debouncedtask needs to do this too, but it also uses them to calculate the id, like this:

def GenerateStableId(instring):
return hashlib.md5(instring).hexdigest()
def debouncedtask(f=None, initsec = 0, repeatsec = 10, debouncename = None, **taskkwargs):
...
def rundebouncedtask(*args, **kwargs):
...
cachekey = "dt%s" % (debouncename if debouncename else GenerateStableId(yccloudpickle.dumps((f, args, kwargs))))
...

return rundebouncetask

I’ve left out plenty of @debouncedtask (you can see the full code here). But notice that if the caller doesn’t provide a debouncename to use as the cachekey (which in turn is used to store information in memcache), then one is generated by pickling the function and arguments, and passing the resulting string to GenerateStableId() . That function in turn generates a cryptographic hash of the string, and returns a string form of that hash.

That is, we take a function (which includes its closure) and arguments, and do the following steps:

  • Pickle it (ie: serialise it)
  • Cryptographically hash it

This technique looks to be useful enough that I want to extract it, name it, and beat it a bit with a stick.

Naming-wise, after some consultation amongst the few and the proud on google+, I’ve chosen the name Flash. It’s a contraction of Function/Lambda Hash, suggested by Kryptyk Physh.

Let’s write a more concise function for making a Flash:

def make_flash(f, *args, **kwargs):
return hashlib.md5(
cloudpickle.dumps((f, args, kwargs))
).hexdigest()

Pickling with cloudpickle

@debouncedtask uses yccloudpickle, so why am I using cloudpickle here?

I’ve written plenty of articles about function serialisation in Python. I ended up having to make a variant of the cloudpickle library, called yccloudpickle:

Since I wrote that article, cloudpickle has implemented code to cover the cases that yccloudpickle was dealing with (serialisation of recursive inner functions), rendering yccloudpickle obsolete, I think. So one of my jobs in the near future is to replace my use of yccloudpickle with cloudpickle in my other libraries. (note: done)

So @debouncedtask will be changed in the future to use cloudpickle, and in the rest of this article I’ll use it.

Note: an assumption I’m making about cloudpickle is that input pickles deterministically; ie: pass the same thing to cloudpickle.dumps() twice, you get the same result. This isn’t necessarily true. It seems to be true, but it needs checking via a deep code read, or disproof via counter-example.

Hashing with md5

For hashing, I wanted something fast, that spread well across the same sized keyspace as a uuid. I’ve settled on md5.

I know that md5 has been found to be weak against attacks, but here we’re not using it in that fashion. The data we are hashing should always originate from our own codebase, so malicious functions or pickles shouldn’t be a problem.

Beware of malicious pickles!

A cryptographic hash function requires five properties:

  • it is deterministic so the same message always results in the same hash
  • it is quick to compute the hash value for any given message
  • it is infeasible to generate a message from its hash value except by trying all possible messages
  • a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value
  • it is infeasible to find two different messages with the same hash value

But in this use case, we’re not doing cryptography; we only require the first, second and fourth property: determinism, speed, and a good spread across the keyspace.

As far as I’m aware md5 is still good for these three properties (yell if you think I’m wrong about this!).

Some quick examples

Here are some quick examples of applying make_flash() to functions. I’m using a helper function print_make_flash() to print the results of make_flash():

import cloudpickle
import hashlib
def make_flash(f, *args, **kwargs):
return hashlib.md5(
cloudpickle.dumps((f, args, kwargs))
).hexdigest()
def print_make_flash(desc, f, *args, **kwargs):
print ("%s: %s" % (desc, make_flash(f, *args, **kwargs)))

Ok, let’s try some stuff. First, let’s try the flash of a trivial functioncleanup().

def cleanup():
pass
>>> print_make_flash("cleanup a", cleanup)
cleanup a: 0d313035eb3dd7fbf4466dd843306649
>>> print_make_flash("cleanup b", cleanup)
cleanup b: 0d313035eb3dd7fbf4466dd843306649

The flashes are the same, as we’d expect. Ok. How about a function which is identical to cleanup() except for name?

def cleanup2():
pass
>>> print_make_flash("cleanup2", cleanup2)
cleanup2: 4bea3fd104557bf1670b814bf6b47e2a

Oh, it’s not the same! Let’s look at the pickle created by cloudpickle:

>>> cloudpickle.dumps(cleanup)
'\x80\x02ccloudpickle.cloudpickle\n_fill_function\nq\x00(ccloudpickle.cloudpickle\n_make_skel_func\nq\x01ccloudpickle.cloudpickle\n_builtin_type\nq\x02U\x08CodeTypeq\x03\x85q\x04Rq\x05(K\x00K\x00K\x01KCU\x04d\x00\x00Sq\x06N\x85q\x07))U\x07<stdin>q\x08U\x07cleanupq\tK\x01U\x02\x00\x01q\n))tq\x0bRq\x0cJ\xff\xff\xff\xff}q\r\x87q\x0eRq\x0f}q\x10N}q\x11NtR.'
>>> cloudpickle.dumps(cleanup2)
'\x80\x02ccloudpickle.cloudpickle\n_fill_function\nq\x00(ccloudpickle.cloudpickle\n_make_skel_func\nq\x01ccloudpickle.cloudpickle\n_builtin_type\nq\x02U\x08CodeTypeq\x03\x85q\x04Rq\x05(K\x00K\x00K\x01KCU\x04d\x00\x00Sq\x06N\x85q\x07))U\x07<stdin>q\x08U\x08cleanup2q\tK\x01U\x02\x00\x01q\n))tq\x0bRq\x0cJ\xff\xff\xff\xff}q\r\x87q\x0eRq\x0f}q\x10N}q\x11NtR.'

See the function name inside the pickle? We’re not just serialising function code + closure + arguments, we’ve also got the function name.

I think that’s probably a good thing. We don’t really want two separately defined functions to flash to the same thing, even if their definitions are identical.

How about the same function with differing arguments?

def inc(a):
return a+1
>>> print_make_flash("inc(5)", inc, 5)
inc(5): 33408568041f4a20fd9e235c8da6d8e9
>>> print_make_flash("inc(6)", inc, 6)
inc(6): 638892bd205173093e1d7f2cdcb521b6
>>> print_make_flash("inc(5)", inc, 5)
inc(5): 33408568041f4a20fd9e235c8da6d8e9

Ok, that makes sense. Pass the same argument, get the same result. Pass a different argument, get a different result.

What about functions that only differ by values in their closure?

def make_printer(a):
def printer():
print a
return printer
>>> printer5 = make_printer(5)
>>> printer5()
5
>>> print_make_flash("printer5", printer5)
printer5: 74b7eff773a40f5723bf168d6eda6691
>>> printer6 = make_printer(6)
>>> printer6()
6
>>> print_make_flash("printer6", printer6)
printer6: 0778c7ee5bb7eb52d4cd0d3a6d693423
>>> printer5b = make_printer(5)
>>> printer5b()
5
>>> print_make_flash("printer5b", printer5b)
printer5b: 74b7eff773a40f5723bf168d6eda6691

Success! You can see that printer5() and printer6() flash to different values, but printer5() and printer5b() are the same.

What about performance?

Here are some quick profiles I’ve done.

First, comparing an md5 of a string to randomly generating a uuid():

import cProfile
import uuid
import hashlib
def genuuids(amount):
for i in xrange(amount):
x = uuid.uuid4()
def genmd5s(amount):
for i in xrange(amount):
x = int(hashlib.md5(str(i)).hexdigest(), 16)
def profile(cmd):
print "*************"
print cmd
print "*************"
print
cProfile.run(cmd)
profile("genuuids(100000)")
profile("genmd5s(100000)")

Here’s the run:

*************
genuuids(100000)
*************
600003 function calls in 0.881 seconds
*************
genmd5s(100000)
*************
200003 function calls in 0.135 seconds

Calculating md5 is much faster than generating random uuids. Unexpected, but cool.

But how fast is pickling?

def picklefuncs(amount):
def f(a):
if a <= 0:
return 1
else:
return f(a-1)+1
for i in xrange(amount):
x = cloudpickle.dumps(f)
>>> profile("picklefuncs(100000)")
*************
picklefuncs(100000)
*************
54100003 function calls (49400003 primitive calls) in 19.291 seconds

Oh. Waaaaaay sloooooower. As you might well expect.

All is not lost, though. We’re still doing 100,000 picklings in 20 seconds. That’s 200 nanoseconds per pickle. That’s ok!

How about a larger function?

def biggerpicklefuncs(amount):
def f(g, a):
if a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
elif a <= 0:
return 1
else:
return f(g, a-1)+1
def h(a):
return f(f(f(f(f(f(f(f, a),a), a), a), a), a), a)

for i in xrange(amount):
x = cloudpickle.dumps(h)

Ok, it’s not a very good function. But how’s it go?

>>> profile("biggerpicklefuncs(100000)")
*************
biggerpicklefuncs(100000)
*************
92900015 function calls (84100015 primitive calls) in 33.077 seconds

Slower. Clearly as the content to be serialised gets bigger, pickling takes longer.

But we’re still in usable territory. That example was still 300 nanoseconds per flash. We could go much further than that and still be happy. What’s a few milliseconds between friends?

Wrapping up

So we’ve got the ability to generate a flash for a function + closure + args.

It’s got potential performance issues with larger data, but whether this matters depends on context.

In the context of Functional Distributed Programming, we’re serialising this stuff and sending it over the wire anyway, running it in a separate process. So the type of overhead we’re talking about probably isn’t much in the scheme of things.

If you want to flash in a more conventional context, and you’re touching the network and/or storage, you’re probably still fine; those things are much slower.

If you’re doing something like rate limiting, this might have more impact, especially as you really get hit (which is when you care). Ideally you want to do the minimum work possible on every rate limited call.

If this becomes a real problem, there’s a way to make flashing much faster, and that’s to cache the beejezus out of cloudpickle. The goal would be to make multiple dumps of the same structure (or different structures aggregating some shared parts) not involve redoing work. It’d require a serious rewrite of that whole library to cache intermediate products in memory or memcache or whatnot. It seems like something which would be possible, but also contain hidden traps, and be quite difficult. So for now I’ll file that under stuff to try if the wheels really come off.

Note: I haven’t added this to the taskutils library yet, but will do in the near future. I’ll replace yccloudpickle with cloudpickle at the same time.