ADVANCED PYTHON PROGRAMMING
The Ones That Got Away
This time, we’ll talk about object-oriented miscellanea that didn’t fit anywhere else: hashing, subclassing native types, and exceptions.
We’re nearing the end of our journey: from stuff as basic as scopes, conditions and loops, through objects, classes and metaclasses — we’re ready to talk about our final topic: modules and packages. Before we do, I’d like to take a moment to address all the object-oriented stuff that I skipped, whether because it didn’t fit the narrative, or I just didn’t want to get bogged down by (even more) details.
Hash, Little Baby
There’s a very important special method I skipped: __hash__
. It’s pretty basic on one hand, and a bit complicated on the other, so it didn’t fit in any of the sections. To understand its background, consider this:
>>> class A:
... pass>>> a1 = A()
>>> a2 = A()
>>> d = {a1: 1, a2: 2}
Everything’s dandy, until one day you decide to add an __eq__
method to A
:
>>> class A:
... def __init__(self, x):
... self.x = x
... def __eq__(self, other):
... return isinstance(other, A) and self.x == other.x>>> a1 = A(1)
>>> a2 = A(2)
>>> d = {a1: 1, a2: 2}
Traceback (most recent call last):
...
TypeError: unhashable type: 'A'
That’s odd. It stems from the way Python implements dictionaries, in a data structure known as a hash table. How would you implement a dictionary in a low-level language like C? You could store all the key-value pairs in an array, but then looking for a particular key will require iterating over all of them, and will take a long time for large dictionaries. A hash table optimizes this lookup by allocating a large array, and when the dictionary is indexed by some key, reducing that key to an index. This way, every object can be mapped to some number i
, and the corresponding value can be stored and retrieved from the i % len(array)
’s slot instantly.
But wait — what if there’s a collision, when two different keys get reduced to the same index? I guess we’re gonna have lists of key-value pairs after all, and iterate over them linearly; except, we’d only do that for collisions, which don’t happen that often, so we’ll still reap most of the benefits of our optimization. Incidentally, that’s exactly what Python does. When you define an object, its default __hash__
implementation, which can be revealed using the built-in hash
function, reduces the object by shuffling some bits of its ID; and since IDs are unique, and the shuffling does some clever math, a collision is pretty improbable.
But then we up and override __eq__
. That opaque error message was Python’s gentle way of telling us that if we choose to override default behavior, we need to be thorough, and implement __hash__
, too. After all, if two objects are equal, they should be reduced to the same index and overwrite each other, much like numbers and strings do; and if we have custom comparison logic based on the object’s state, how can Python derive an appropriate hash on its own?
Implementing a hashing algorithm might sound intimidating: are we supposed to do maths? Java programmers will probably be familiar with the 7/31 formula, which is a popular “good enough” implementation for Java’s hashCode
. The idea is to start from 7, and for every field that matters, add it to the result so far, multiplied by 31. This yields a unique enough value on one hand—and assures that objects with the same fields have the same hash on the other:
>>> class Point:
... def __init__(self, x, y):
... self.x = x
... self.y = y
... def __eq__(self, other):
... return (
... isinstance(other, Point)
... and self.x == other.x
... and self.y == other.y
... )
... def __hash__(self):
... h = 7
... h = 31 * h + self.x
... h = 31 * h + self.y
... return h>>> p1a = Point(1, 1)
>>> p1b = Point(1, 1)
>>> p1a == p1b
True
>>> hash(p1a)
6759
>>> hash(p1b)
6759>>> p2 = Point(2, 2)
>>> p1a == p2
False
>>> hash(p2)
6791
But that’s a lot of hassle; why don’t we use Python’s built-in hash
function instead? We can just pack “all the fields that matter” in a tuple and hand it over, like so:
>>> class Point:
... ... # Same as before
... def __hash__(self):
... return hash((self.x, self.y))>>> p1 = Point(1, 1)
>>> p2 = Point(2, 2)
>>> hash(p1)
8389048192121911274
>>> hash(p2)
1901736143494378007
Looks pretty random to me. And pretty short.
Some More Funny Methods
Another method I haven’t mentioned is __format__
. We talked about __str__
, which returns a human-readable representation, and __repr__
, which provides more context for developers. However, an object can support a more “parameterized” formatting; you know, like what we use to pad floats and strings, or format datetimes:
>>> print(f'{1/3:.2f}'
0.33>>> print(f'{"Title":^20}'); print('='*20)
Title
====================>>> import datetime as dt
>>> print(f'{dt.datetime.now():%d/%m/%Y %H:%M:%S}')
24/04/2020 16:26:38
Whatever’s after the :
gets passed to __format__
’s argument, fmt
. So:
>>> class Point:
... ... # Both cartesian and polar properties
... def __format__(self, fmt=None):
... if fmt == 'polar':
... return f'{self.r}, {self.t})'
... return f'({self.x}, {self.y})'>>> p = Point(0, 1)
>>> print(f'{p}')
(0, 1)
>>> print(f'{p:polar}')
(1.0, 1.57)
There’s also a nuance about iteration I haven’t told you about. Remember how __iter__
should return an iterator with a __next__
method? Well, if you’d like to be able to iterate over your object in reverse, you’d have to implement it yourself in __reversed__
, since there’s no __prev__
method, and no default way for Python to play an iterator in reverse; primarily because an iterator can be generative, and not really know when it ends until you play it all out. This is not enough:
>>> class A:
... def __init__(self, x):
... self.x = x
... def __iter__(self):
... for i in range(self.x):
... yield i>>> a = A(3)
>>> for i in a:
... print(i)
0
1
2
>>> for i in reversed(a):
... print(i)
Traceback (most recent call last):
...
TypeError: 'A' object is not reversible
But this is:
>>> class A:
... ... # Same as before
... def __reversed__(self):
... for i in range(self.x):
... yield self.x - i - 1>>> for i in reversed(a):
... print(i)
2
1
0
Another funny method we haven’t discussed is __missing__
. You can add it to dictionaries, and if a key is missing, your method will be invoked to provide some default value for it:
>>> class D(dict):
... def __missing__(self, key):
... return key
>>> d = D(x=1, y=2)
>>> d['x']
1
>>> d['y']
2
>>> d['z']
'z'
It’s a pretty silly method, since it’s easy enough to overwrite __getitem__
:
class D(dict):
def __getitem__(self, key):
if key not in self:
return key
return super().__getitem__(key)
But for some reason they decided to add it; maybe it’s an optimization. In any case, the interesting part about this example is not the method — but the fact that I’ve subclassed a native type. I’ve actually done it before, when we were playing with metaclasses and passing our own dictionaries to __prepare__
; but it’s alright if you were too overwhelmed to notice. Which brings us to our next topic:
Imperialism
Python actually lets us subclass almost all native types, and to great acclaim. We can make dictionaries work with dot notation, like so:
>>> class D(dict):
... def __getattr__(self, key):
... return self[key]
... def __setattr__(self, key, value):
... self[key] = value
... def __delattr__(self, key):
... del self[key]>>> d = D(x=1, y=2)
>>> d.x
1
>>> d.x = 2
>>> del d.x
Or, let’s say we’re writing a Log
class, which filters messages based on their levels (DEBUG
, INFO
, WARN
, ERROR
, etc.). These levels are integers; but it’d be nice to have their name attached. If we do this:
>>> class Level:
... def __init__(self, name, value):
... self.name = name
... self.value = value
It’s alright, except now we have to implement all the comparison operators, because our code is littered with if level > INFO
s and such. Would if we could:
>>> class Level(int):
... def __new__(cls, name, value):
... instance = super().__new__(cls, value)
... instance.name = name
... return instance>>> level = Level('DEBUG', 1)
>>> level
1
>>> level > 2
False
>>> level.name
'DEBUG'
So, an integer for all intents and purposes — but one with a name. It’s a bit unfortunate we have to implement it in __new__
, but if we’d only change the __init__
, Python would default to int
’s __new__
, which only accepts one argument.
Similarly, let’s say we’re implementing a Path
class with a bunch of methods, like read
and write
—but we’d like it to work with os.path.dirname
and the like, all of which only accept strings. We could do this:
>>> class Path(str):
... def read(self):
... with open(self) as fp:
... return fp.read()
... def write(self, data):
... with open(self, 'w') as fp:
... fp.write(data)>>> p = Path('/tmp/file.txt')
>>> p.write('Hello, world!')
>>> p.read()
'Hello, world'
>>> os.path.dirname(p)
'/tmp'
(As an aside, we shouldn’t; while os.path
is pretty annoying to work with, pathlib
is an excellent module, and most of the standard library works well with it, so there’s no need to reinvent the wheel.)
The Exception to the Rule
One last thing I’d like to talk about before we proceed is exceptions—the object-oriented error code. It’s a pretty interesting control flow mechanism that, generally speaking, has this structure:
>>> try:
... raise RuntimeError()
... except RuntimeError:
... ... # Handle error
Except exceptions propagate, so an object can declare them as part of its interface: I return so and so, but if something goes wrong, I raise this and this error — catch it and handle it, if you want. It adds another degree of freedom to the language, even though some people would argue it also adds a lot of unpredictability (not to speak of overhead) to the code.
To those people, I’d like to tell a story. Once, I really wanted a bike. Every day, I’d pray for God to give me a bike, but nothing happened. Eventually, I realized it doesn’t work like that— so I stole a bike, and prayed for God to forgive me. This story isn’t actually true, but it exemplifies an idiom that’s pretty popular with the Python community: it’s better to ask forgiveness than to ask permission — and I have to say, as an Israeli, I can appreciate the merits of chutzpah, so I tend to agree. It means that instead of writing this:
if key in d:
print(d[key])
We should write this:
try:
print(d[key])
except KeyError:
pass
It’s more cumbersome, true; but technically, it’s also more correct. In the first case, we check that the key is present in the dictionary before we access it. That’s lovely — as long as our code is single-threaded. Otherwise, some thread might snatch it right under our nose, a moment after the if key in d
but a moment before the d[key]
; so effectively, we’d end up with a KeyError
anyway, and we might as well drop that “ask for permission” altogether.
Clauses Except Except
Exceptions in Python are actually much more powerful that just try
and except
. You can add a finally
clause, which happens regardless of whether an exception was raised or not:
try:
...
except RuntimeError:
...
finally:
print('This happens anyway')
And you can even drop the except
:
try:
...
finally:
print('This happens anyway')
In which case the exception will be propagated — but not before the finally
clause is executed. Great way to handle stuff like closing files, releasing locks, or implementing context managers that do it for you ;)
Similarly, exceptions support an else
clause, which happens only if no exception was raised. Pretty poor naming, if you ask me—noerror
would’ve been better, just like nobreak
would’ve been better for loops. Anyway:
try:
...
except RuntimeError:
print('failure')
else:
print('success')
But let’s talk a bit more about the except
clause, since at the end of the day, it’s what it’s all about. We can catch one exception class, or several:
try:
...
except NameError, TypeError, ValueError:
...
And we can bind it (or them) to names that will be available in the body of the except
clause, like so:
try:
...
except RuntimeError as error:
print(f'failure: {error}')
For multiple classes, we’d have to add parenthesis:
try:
...
except (NameError, TypeError, ValueError) as error:
print(f'failure: {error}')
Alternatively, we could drop the exception classes altogether, and just do:
try:
...
except:
...
Which catches any and all exceptions. Generally, this is frowned upon, because stuff like KeyboardInterrupt
(raised by a SIGINT
signal, which is sent by pressing CTRL+C), SystemExit
(the exception raised by exit()
to terminate the program), and even SyntaxError
, are all exceptions—and we probably didn’t mean to ignore missing dependencies, program abortion, and invalid code. What we actually meant was:
try:
...
except Exception:
...
Which catches any Exception
; that is, regular errors (like NameError
, TypeError
, ValueError
or RuntimeError
), all of which subclass it. In any case, once inside the except
clause, we can either handle that error, raise a different one, or re-raise the same. To re-raise, simply:
try:
...
except Exception as error:
print(f'error: {error}')
raise
And Python will know what you mean. This let’s you inject some code, like logging the exception, without really interfering with its propagation.
An Exceptional Family
Like most classes, exceptions can be subclassed — and some projects define their own exception hierarchy, because they want their users to be able to catch particular errors, while letting generic stuff (like a ZeroDivisionError
) propagate farther, as they probably indicate a more fundamental flaw in the system. An examples of this is SQLAlchemy, which provides all sorts of custom exceptions for different errors, like NoSuchTableError
for when a certain table doesn’t exist in the database, or IntegrityError
for when you’re trying to add a value that should be unique, but already exists.
Yet other frameworks provide custom exceptions for you to throw, rather than catch: in Werkzeug, the HTTP engine behind Flask, you can raise a NotFound
exception, which will result in the server returning an HTTP response with 404 NOT FOUND
; or a BadRequest
exception, which returns 400 BAD REQUEST
.
In either case, you should think twice before implementing your own exception hierarchy: it adds cognitive load, and for most cases, a standard error is enough. Just stick to ValueError
for bad arguments, TypeError
for improper usage, KeyError
for missing keys, AttributeError
for missing attributes, and when in doubt—a RuntimeError
with an informative error message.
These distinctions might not sound very important: why not always raise an Exception
with a description? As it turns out, the type of the exception you raise actually matters—a lot. Take our über-dictionary from before, for example:
>>> class D(dict):
... def __getattr__(self, key):
... return self[key]>>> d = D(x=1, y=2)
>>> d.x
1
>>> d.y
2
This is all well and good, until for some reason, someone tries to access your dictionary with getattr
: having implemented __getattr__
, it should support it, right? Well, here’s how getattr
works:
>>> o = object()
>>> getattr(o, 'x')
Traceback (most recent call last):
...
AttributeError: 'object' object has no attribute 'x'
>>> getattr(o, 'x', 1)
1
As you can see, we can provide a default value instead of that nasty AttributeError
. As for our dictionary:
>>> getattr(d, 'x')
1
Seems to work. But wait…
>>> getattr(d, 'z')
Traceback (most recent call last):
...
KeyError: 'z'
>>> getattr(d, 'z', 1)
Traceback (most recent call last):
...
KeyError: 'z'
Which is surprising, since we’ve specifically provided a default value. The key to that mystery is the error raised: it’s not an AttributeError
, which is what getattr
expected and replaces, but a KeyError
, raised because the key wasn’t in the dictionary, and __getattr__
simply delegated the work to __getitem__
. Let’s fix that:
class D(dict):
def __getattr__(self, key):
try:
return self[key]
except KeyError:
raise AttributeError(key)
And now:
>>> d = D(x=1, y=2)
>>> getattr(d, 'z', 1)
1
It works! Except when an error is raised, it’s much uglier:
>>> getattr(d, 'z')
Traceback (most recent call last):
...
KeyError: 'z'During handling of the above exception, another exception occurred:Traceback (most recent call last):
...
AttributeError: z
Whoa. That happens because Python chains exception — so whenever an exception originates in an except
clause, it keeps a pointer to the previous exception, which is currently being handled, and Python lays out this entire history. You can chain exceptions yourself, by using raise
with a from
clause—and in this case, we’ll do just that to unchain them:
class D(dict):
def __getattr__(self, key):
try:
return self[key]
except KeyError:
raise AttributeError(key) from None
The exception is explicitly raised from nothing, so we get a clean slate.
Another tricky situation to look out for is AttributeError
s in property
s. Check out this code:
class A:
@property
def p(self):
return self.x
def __getattr__(self, key):
if key.startswith('_'):
return 1
raise AttributeError(key)
This class has a buggy property, and a silly __getattr__
, which returns 1 for any dynamic attribute starting with _
, and raises the customary AttributeError
otherwise. But then, this happens:
>>> a = A()
>>> a.p
Traceback (most recent call last):
...
AttributeError: p
This is really weird, because I’d expect the attribute error to be about x
, not about p
; and when debugging real code that’s not so obviously broken, trying to figure it out can drive a person mad. What actually happens is that Python tries to resolve p
, which tries to resolve x
, which raises an AttributeError
, as you might expect; but that’s the default behavior for any non-existing attribute—so Python decides that since p
is clearly missing, it should call __getattr__('p')
, and it’s actually that scoundrel that raises the error. You can validate it with a quick print:
>>> class A:
... @property
... def p(self):
... return self.x
... def __getattr__(self, key):
... print(f'getting {key}') # Added this!
... if key.startswith('_'):
... return 1
... raise AttributeError(key)>>> a = A()
>>> a.p
getting p
Traceback (most recent call last):
...
AttributeError: p
So yeah — know thy exceptions.
Conclusion
Now we really covered Python’s data model. Well, not really: some methods, like __aiter__
and __aenter__
, we’ll discuss when we learn about asynchronous programming; and some, like __slots__
and __length_hint__
, when we learn about performance. My point is, I knight thee a champion of the object-oriented arts—and on we move to our next challenge: modules and packages!
The Advanced Python Programming series includes the following articles:
- A Value by Any Other Name
- To Be, or Not to Be
- Loopin’ Around
- Functions at Last
- To Functions, and Beyond!
- Function Internals 1
- Function Internals 2
- Next Generation
- Objects — Objects Everywhere
- Objects Incarnate
- Meddling with Primal Forces
- Descriptors Aplenty
- Death and Taxes
- Metaphysics
- The Ones that Got Away
- International Trade