Python: From Dictionaries to Data Classes

Dale Seo
Prodigy Engineering
9 min readJun 16, 2021
Photo by Hitesh Choudhary on Unsplash

When developing data-driven applications, we need to handle data in memory before persisting it in a database and after retrieving it from the data store. It is crucial to handle data in memory both accurately and efficiently to ensure the system’s reliability.

As the payments platform at Prodigy Education has scaled over the last couple of years, our strategy to handle data in memory in Python has gradually evolved. In this blog post, I talk about the lessons learned taking various approaches from dictionaries to data classes, and how we’ve embraced type hinting to build modern large-scale applications along the way.

What are Dictionaries in Python?

Dictionaries are one of the most popular built-in data structures in Python. Using dictionaries, you can store key-value pairs and access values very quickly with keys. They are so common in Python that you can see them used almost everywhere whether it’s a data science project, a web application, or even your first Python 101 course.

It’s very easy to use dictionaries. All you need to create a dictionary is a pair of curly braces as follows.

user = {"id": 1, "name": "John Doe", "admin": False}

When we started to build payment systems at Prodigy, most of the team members were new to Python and we reached for dictionaries since they were so widely seen in other projects out there. However, as the team was constantly learning more about Python, we started to realize that it can be quite expensive to maintain code written with dictionaries for several reasons.

First of all, it’s prone to human errors. If you were not familiar with the codebase, you might make a silly mistake like this while trying to make a user admin:

>>> user["is_admin"] = True
>>> user
{'id': 1, 'name': 'John Doe', 'admin': False, 'is_admin': True}

As another example, someone new on your team might introduce a bug like this in your application, that is not easy for others to catch through code review:

>>> user["user_id"]
Traceback (most recent call last):
File "<input>", line 1, in <module>
user["user_id"]
KeyError: 'user_id'

The main drawback of dictionaries was that they could be mutated in any way, no matter the consequences, while they were in memory:

>>> user["id"] = 2
>>> user
{'id': 2, 'name': 'John Doe', 'admin': False}
>>> del user["name"]
{'id': 2, 'admin': False}

This example is contrived to be extreme but it does give you an idea of how your data can get out of hand in a dictionary.

Although dictionaries allowed us to move very quickly in the early days of the applications, they felt too flexible and somewhat risky. Dictionaries started to scare away many developers on the team.

Named Tuples in Python

Named tuples are an immutable alternative to dictionaries. They are not as commonly used as dictionaries but you can easily come across them in projects where immutability must be guaranteed.

Named tuples are part of the standard library of Python. You can create a named tuple using the namedtuple function of the collections module as follows.

>>> from collections import namedtuple
>>> User = namedtuple("User", ["id", "name", "admin"])
>>> user = User(id=1, name="John Doe", admin=False)
>>> user
User(id=1, name='John Doe', admin=False)

The biggest difference between dictionaries and named tuples is that values stored in a named tuple are not allowed to be updated.

>>> user.id = 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: can't set attribute

In addition, you should only use the defined keys, which reduces the chance of making typos.

>>> user.user_id
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'User' object has no attribute 'user_id'

When one of my teammates suggested using named tuples instead of dictionaries, developers on the team immediately liked the solution especially since immutability made so much sense in the payments domain where slight changes in critical data such as financial and personal information can have far-reaching consequences for customers.

We were happy with named tuples for a while but named tuples exposed one critical flaw later on: They don’t care about data types, which can lead to a subtle bug like this:

>>> user = User(id=1, name="John Doe", admin="True")
>>> user.admin == True
False

Please notice that it is possible to assign a string value to the admin field which is expected to be a boolean value.

The immutable nature of named tuples had a lot of value for the team but a lack of data types left something to be desired. It was clear that we needed a better option to handle data in memory type safely.

Trying Plain Classes in Python

Leveraging value objects or data holder objects is one of the best practices in OOP(Object Oriented Programming). This concept is often called POPO(Plain Old Python Object) in Python and is typically implemented using custom classes with no dependency on 3rd-party libraries/frameworks.

For example, you can create a class that represents a user of your system as follows. Please notice that type hints are used to indicate the data type for each field.

class User:
def __init__(self, id: int, name: str, admin: bool = False):
self.id = id
self.name = name
self.admin = admin

Using classes with type hints was a pivotal decision for the team because we were able to do type-checking using a static type checker like Mypy.

If one tried to set the admin filed to a string value instead of a boolean value like this,

User(id=1, name="John Doe", admin="True")

Mypy would catch this type error and give them feedback.

$ mypy user.py
user.py:8: error: Argument "admin" to "User" has incompatible type "str"; expected "bool"
Found 1 error in 1 file (checked 1 source file)

Though, one caveat of plain classes was that it didn’t come with default implementation of dunder methods such as __repr__, __eq__ and __hash__.

>>> User(id=1, name="John Doe")
<__main__.User object at 0x10c2e7cd0>

It was tedious to implement them on our own, resulting in boilerplate code like this:

class User:
def __init__(self, id: int, name: str, admin: bool = False):
self.id = id
self.name = name
self.admin = admin
def __repr__(self):
return (
self.__class__.__name__ + f"(id={self.id!r}, name={self.name!r}, admin={self.admin!r})"
)
def __eq__(self, other):
if other.__class__ is self.__class__:
return (self.id, self.name, self.admin) == (
other.id,
other.name,
other.admin,
)
return NotImplemented

While plain classes solved most issues with dictionaries and named tuples, this solution didn’t quite resonate with the team since it came with the expense of writing some boiler-plate code.

Using Data Classes in Python

Data classes (PEP 557) are one of the cool features that were added in Python 3.7 and have been since gaining popularity in the Python community. Data classes simplify the process of writing classes by generating boiler-plate code.

You just need to annotate your class with the @dataclass decorator imported from the dataclasses module. Here’s what a typical data class looks like.

from dataclasses import dataclass@dataclass
class User:
id: int
name: str
admin: bool = False

Please note that the syntax is slightly different from creating regular classes though. You should define fields as class variables, not instance variables.

The obvious benefit of using data classes over plain classes is that it’s no longer required to implement dunder methods manually to provide a nice string representation or customize equality logic. This had a positive impact on the development velocity of our team that needed to maintain a lot of value classes.

>>> User(id=1, name='John Doe')
User(id=1, name='John Doe', admin=False)
>>> User(id=1, name='John Doe') == User(id=1, name='John Doe')
True

Additionally, immutability could be easily achieved by the frozen option.

from dataclasses import dataclass@dataclass(frozen=True)
class User:
id: int
name: str
admin: bool = False

Please note that the error sounds quite explicit compared to named tuples.

>>> user = User(id=1, name="John Doe")
>>> user.admin = True
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'admin'

It was just the tip of the iceberg of data classes. They came with lots of options not only at the class level but also at the field level.

from dataclasses, field import dataclass@dataclass(order=True, unsafe_hash=True)
class User:
id: int
name: str
admin: bool = False
memberships: List[Membership] = field(default_factory=list)

The code above makes it possible for instances of the User data class to be ordered and hashed by specifying a couple of options of the @dataclass decorator. The memberships filed will also be assigned a fresh list unless it’s specified.

>>> user1 = User(id=1, name="John Doe")
>>> user2 = User(id=2, name="Jane Doe")
>>> user1 < user2
True
>>> sorted([user2, user1])
[User(id=1, name='John Doe', admin=False), User(id=2, name='Jane Doe', admin=False)]
>>> user3 = User(id=1, name="John Doe")
>>> user4 = User(id=2, name="Jane Doe")
>>> set([user1, user2, user3, user4])
{User(id=1, name='John Doe', admin=False), User(id=2, name='Jane Doe', admin=False)}

When I discovered data classes and shared them with the team, everyone on the team was so excited about being able to get type safety with much less code. Data classes seemed like a very balanced solution and the team organically adapted to this solution, progressively refactoring our code.

Finding Success with Python Typing

Python is considered a dynamically-typed language, and this flexibility contributed to its early adoption at Prodigy. However, the addition of type hints in recent versions of Python opened up great opportunities for static typing as well.

One of the biggest wins of using data classes with type hints for our team was that it helped us design better APIs.

For example, when we used dictionaries or named tuples, our API looked like this with type hints.

def do_something_with_user(user: dict) -> dict:
...

You can technically call this function with any dictionary and it’s not clear that the returned dictionary contains user information. It turns out that type hints don’t add a lot of value with dictionaries and named tuples.

On the other hand, with data classes, we now write the same function as follows.

from app.users import Userdef do_something_with_user(user: User) -> User:
...

If you pass something other than an instance of the User class, your type checker will notice it and tell you to fix it. This function is way safer and easier to work with.

Being able to catch type errors before shipping new code was a game-changer from a maintenance perspective. We saw more and more bugs identified at build time that our customers would’ve encountered in production if it had not been for type-checking.

Python typing also made a huge difference in our code quality. We were able to quickly figure out what the function does by glancing at the function signature because the type hints served as great documentation. We no longer need to add a verbose docstring to explain what the function accepts and returns and it’s also easier to write tests when you have a better idea of what is expected.

Last but not least, our overall developer experience has gotten better significantly. We integrated Mypy into our CI so that no one accidentally merges PRs with type errors. Developers felt more confident about making code changes with a pre-commit hook for Mypy running locally. Our code editors supported much better autocomplete and IntelliSense with type hints, making development a breeze. It had a positive impact on developer productivity and happiness overall.

Python: the importance of using the right tool

How would you handle data in memory in Python? Each approach that I’ve covered here in this article has its own set of pros and cons and I think the best solution for your application depends on what phase of growth it is in.

Dictionaries should be still a good option in scenarios where flexibility is needed more than rigidity. Data classes might not make as much sense for some small prototypes or standalone scripts. Type checking could only add to the complexity in your tooling and slow your team down due to the learning curve.

Having said that, if your service deals with mission-critical data and needs to specify rich data structures like our payments platform at Prodigy, data classes can go a long way towards building robust applications with fewer runtime errors.

It has been almost a year since I started using data classes. They have really grown on me and now, using them is second nature to me. Personally, I suggest data classes for pretty much every project that would live long enough to realize all the benefits.

Type hinting appears to be relatively a new concept in Python but this whole trend toward typing just reminds me of how TypeScript has been so popular in the JS scene. I believe type hints will become an essential part of every Python programmer’s toolkit in the near future.

I hope you have enjoyed our journey all the way from dictionaries to data classes and it helps you choose your own solution for your next Python projects.

If you’re interested in joining our team, check out our open positions here

--

--