Python gotchas: hash()

Erick Mendonça
Jul 30, 2017 · 3 min read

Or: WAT?

Python is an amazing programming language. I love it. A lot of people love it also. Everyone should love it!

Scene from the movie Léon: The Professional (1994)

Er… okay, sorry for my enthusiasm. But hey, I just love this goddamn language. But we need to talk about some… things. Things that may hurt you while writing amazing Python code. You know, those things may happen in any language.

I'll try to talk about some of those things, like the hash function or mutable default parameters on functions, an article at a time. Let's start with…

hash()

The hash function is amazing: it’s really important for Python tuples and dictionaries, so pay respect. It returns an integer that represents whatever object you pass to it. For custom objects, it defaults to id() unless this behavior is overwritten on custom __hash__() methods. But for other objects, like numbers or strings, it can be used to quickly compare values, like when the interpreter is doing a dictionary lookup.

Imagine that you have a huge amount of data as input, that you need to process and validate before inserting on your production database. To do that, you might create some tasks that run distributed on threads or servers. But you would need to fetch and sync data between those tasks, so you might try to be a little clever and start using hash() to create unique — or almost unique — keys for saving your data on a cache, for example.

Let cache be something like Django cache

But beware! Starting on CPython 3.3, the hashes of certain objects, as strings, are salted, so they change between different processes. You can omit this behaviour, but it is turned on by default which may cause some surprises. Bad ones.

A ~happy~ user

Getting back to the previous scenario, let's imagine that you use something like that snippet to cache some semi-processed data while another task (that may be or may be not on the same thread, or server…) fetches this data to aggregate or save in the database.

Just wonder what could happen if you test your code on an environment that has hash randomization disabled, but deploy it on an environment with the default behaviour! On your laptop, you might not have the same amount of workers and parallelism that you have on production, so everything might run on the same thread and all is good. But as soon as you push your code to production, all hell breaks loose: a lot of exceptions, invalid lookups, processing jobs that don't finish and bad user experiences.

TL;DR

So, the main takeaway is: the result of hash() may change between processes, runs or different computers. Don't take it for granted that they will be the same! If you need this behaviour, you'll have to enforce it on all environments, or just use something else instead: search for a unique value, build a unique string and so on.

On a sidenote, if you want to read some more about hash(), I suggest this one to start.

Erick Mendonça

Written by

Proud daddy, Software Engineer @CartaInc, Pythonista, Star Wars and Mass Effect fan. https://about.me/erickmendonca

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade