Using collections in Python

The collections module in Python, provides solutions for multiple scenarios, that would be otherwise tricky to implement.

Let’s see how the abstractions this module provides, come in handy when dealing with different kinds of problems.


Extending built-in Types

Sometimes you need an object that behaves exactly like a built-in type of Python, but you would like to also extend some behaviour of it. For this, the most common approach is to directly subclass the type.

For example, let’s imagine a fictional event system, on which we model our events as dictionaries, and we’d like to have some extra meta-data over the events. Something like the following code might be our first approach:

class Event(dict):
def __getitem__(self, key):
value = super().__getitem__(key)
return f"[{self.__class__.__name__}]: {value}"

If we try to use this code, we’ll find that it covers the basic cases as we might expect:

event = Event(user="user", event_type="login", date="2017-07-11")
print(event['user']) # [Event]: user

However, we want it to be completely compatible with a dictionary, meaning that has to actually be a dictionary. Here is where the tricky part and the surprises begin. Let’s try to iterate its keys and values to see what’s in there:

for key, value in event.items():
print(f"{key} = {value}")

Here we would expect the values with the transformation we defined (including the class of the event as prefix), but instead we get the underlying values of the dictionary, ignoring our implementation of __getitem__() . This code will print the following [3]:

user = user
event_type = login
date = 2017-07-11

What’s happening here is that the items() method doesn’t call our implementation of __getitem__() . Instead is using the underlying C implementation that doesn’t look for the rest of methods defined on the same object.

One more example: let’s say that now we want to log our events in a custom fashion, and for that we create a function like the following one, and we want to pass keyword argument from our dictionary.

def log_event_type(event_type, date, **kwargs):
print(f"{date} - {event_type}")
log_event_type(**event)   # 2017-07-11 - login

Again, we want our object to be a dictionary, so we’d expect that the ** would work normally, but again it’s not calling our method.

All of these problems are solved, if instead we subclass collections.UserDict. It’s available with the goal of creating dictionary-like objects in our code.

class Event(collections.UserDict):
def __getitem__(self, key):
value = super().__getitem__(key)
return f"[{self.__class__.__name__}]: {value}"

Just with this definition, all the aforementioned examples will work the way we expected to (the items, the keyword arguments, etc.). This is an implementation detail that you don’t need to know about, but the object is a wrapper around a dictionary (called data), and when overriding the methods, this will be applied over the wrapped data. You don’t have to access the data attribute at all, the object itself will behave as a dictionary.

The same criteria applies not only to dictionaries, but also to lists, and strings. For each case, when you need to subclass, consider using collections.UserDict , collections.UserList , and collections.UserString , respectively.


Returning Multiple Values From a Function

Functions return one thing. When we return multiple values, we’re actually creating a tuple and returning it (again, one value).

As the number of values to return grows, the code becomes harder to read. In particular I have the rule of thumb, that for more than 3 arguments, namedtuples should always be used instead.

Consider the differences in readability between these two approaches:

def fetch_data():
return 42, 'jsmith', 'John', 'Smith'

Or:

UserRecord = collections.namedtuple("UserRecord", ('user_id', 'username', 'first_name', 'last_name'))
def fetch_data():
return UserRecord(42, 'jsmith', 'John', 'Smith')

In the first example, if someone is calling the function in order to get the data, he or she either has to guess [1], or know for sure, what parameters are returned, and in which order. It requires so in order to know how to unpack the values when calling the function (it would be even worse to get the tuple, having to know that username == row[0]).

With the named tuple, it’s clear that it’s returning a particular type of object, and exploring the definition of that object is enough to know what it contains, and how the data can be accessed.


More Efficient and Compact Computations

A common idiom is to count the occurrence of elements in some data source. The preferred data structure for this mapping is a dictionary. A pseudo-code would look like:

occurrences = {}
for element in source:
occurrences[element] = occurrences.get(element, 0) + 1

However correct, this is not Pythonic. A more pythonic implementation would take advantage of the standard library:

occurrences = collections.Counter(source)

This will give us a dictionary-like object ready to be used.

It receives any iterable, and it will work by looping over it, and mapping each distinct occurrence of elements on that iterable, with the number of times it appears.

For example we could pass an inline generator to it, in order to count the events per type, as the occurred:

>>> collections.Counter(e['event_type'] for e in event_stream)
Counter({'login': 8, 'logout': 2})

The first, naive, implementation of the previous example, used a dictionary with default values. Something equivalent to doing occurrences.setdefault(element, 0). There is a better, and more efficient way of having defaults in mappings: collections.defaultdict.

Created with a callable object, a default dictionary, will call this callable when the key is not found. It’s more compact, and efficient than setting the default every time.

So for example, if we were to group the events by type, we could create this sort of mapping:

events_map = defaultdict(list)
for event in event_stream:
events_map[event['event_type']].append(event)

Stacks are another data structure that come in handy for solving many problems. Lists are a very versatile built-in data structure in Python. But trying to use lists as stacks or queues, is an anti-pattern in Python. Don’t do this:

my_list.insert(0, element)
first = my_list.pop(0)

The reason is that, lists will compute these operations in O(n) and we could do much better.

Instead, collections.deque provides the right support for these operations, with operations like .append(element) or .popleft() that compute in O(1) . The code is not only more idiomatic, it’s also more efficient [2].

Other Helper Classes

There are two last classes that are arguable not indispensable anymore. Still, they are worth exploring.

The first one is ChainMap. This will create a new mapping, from several maps passed by parameter. When the original maps change their values, these changes are reflected in the chained map. One can argue that, given the fact that since Python 3.5 you could create a new dictionary by concatenating many with the syntax new_dict = {**d1, **d2}, this is no longer needed. However, there are some differences.

The new syntax introduced in Python 3.5 and on, solves only the problem of creating a dictionary from other dictionaries, but ChainMap does much more, let’s see.

context = {"system": "PROD1", "status": "up"}
event = {'date': '2017-07-11', 'event_type': 'login', 'user': 'user'}
enriched_event = {**event, **context}

Here, we created the enriched event, with the keys of both the event and the context. This iterates over all the dictionaries (in order), and takes the key with their values to place them in the new dictionary. If we then change one of the source dictionaries, these won’t be reflected on the enriched event (it was already created, now it’s a new object). However, with a ChainMap, this would be reflected.

enriched_event = collections.ChainMap(event, context)
enriched_event['system'] # PROD1
context['system'] = "PROD2"
enriched_event['system'] # PROD2
# it also works the other way around
enriched_event['user'] = 'another_user'
event['user'] # another_user

Think of it as a view over multiple other mappings.

The second class is collections.OrderedDict, which is used a lot to have a dictionary that preserves the order of the keys. Originally it was created by passing a list of tuples of two elements, like [(key, value),...] . However, from Python 3.6, now the order of the keyword arguments is preserved, so you can create it as a regular dict, and it will also maintain the order. This is the reason some might think it’s also a deprecated feature: the fact that keyword arguments preserve the order is a consequence that now in Python dictionaries maintain order. Meaning, dicts are nowOrderedDicts (sort of).


Conclusions

A summary of the main points discussed about collections:

  • Don’t subclass built-in types directly: Use UserDict , UserList , or UserString when required. Subclassing the built-in types directly, will yield non-obvious errors that are hard to debug, and identify at first glance.
  • Use namedtuple for grouping multiple values, or when a function has to return multiple parameters.
  • Take full advantage of features like Counter , and defaultdict for solving common problems.
  • Don’t overuse lists, and don’t try them as something else. For stack or queues operations, use collections.deque.
  • OrderedDicts are dead [4]. ChainMaps, not so much.

Collections also includes a module for dealing with types through abstract base clases. A similar consideration applies to the one stated on the first section: prefer using these interfaces when checking for types. For example, using isinstance(my_dict, collections.abc.MutableMapping) instead of isinstance(my_dict, dict) . The reason being, that if my_dict is not exactly a dict, but a dict-like object (again, maybe an instance of a class that derives fromcollections.UserDict, or a ChainMap, etc.), then the former will work whilst the latter won’t.

All in all, the collections module is a great source of utilities, that help us write more pythonic, and efficient code.


References and Notes

References and more sources of information:

Notes:

  1. “In the face of ambiguity, refuse the temptation to guess.” — The zen of Python.
  2. https://wiki.python.org/moin/TimeComplexity
  3. This is in CPython. In pypy, the implementation should work as you would expect. However, this is not a solution: The code has to be standard Python code, able to work on most (if not all) implementations, correctly.
  4. Raymond Hettinger on “OrderedDicts are dead”: https://twitter.com/raymondh/status/773978885092323328