Why I love Python Data Classes, and why you should, too

Jose Haro Peralta
Python Geek
Published in
6 min readNov 3, 2019

Available in the standard Python library since Python 3.7, Data Classes are a bit of a game changer. They make so many things so much easier, and they open lots of possibilities in your code.

What are data classes good for? Typically, for everything that has to do with carrying data around and making transformations on that data. True, you can just use dictionaries and functions to do that, and perhaps add named tuples into the mixture. This option is actually quite OK when the data that you’re handling requires few transformations, but it gets messy otherwise. Then, you could say, you can use just normal classes. And that’s totally fine, only that for something which is meant to just carry data, the traditional class is a bit cumbersome.

There ought to be a middle point between bare bones data structures such as dictionaries, and the full-blown state management machinery that classes provide. And that’s where data classes come in. Don’t get me wrong, data classes are just normal classes. But both their design and the more simplified interface that they offer make them a more qualified candidate for the task of carrying data.

Imagine we are trying to work with data about historical prices of properties in a certain area. We can organise our data in different ways, but for the sake of argument, let’s say we are going to organise our data on a per property basis. The most basic data structure will require the following attributes:

  • Address
  • Array of previous sold prices for the property. Each element in the array will have the following attributes: date sold and price sold
  • Type of property: residential or commercial
  • Type of residential property: flat, bungalow, terraced house, detached house, semi-detached house
  • Number of rooms
  • Unique resource identifier (could be an ID or link to the property in some website)
  • Whether it has a garden
  • Whether it has a balcony
  • Type of tenancy: freehold or leasehold

We could list many more attributes, but for the purposes of this example, this should be enough. For the purposes of this article, let’s assume that the data comes from a JSON file.

If everything was straightforward, a dictionary would do the job for this pretty well. It might look like this for a specific property:

describing properties with dictionaries

If everything was perfect, this data structure might be just fine, and you could access data fields through keywords, such as property['has_garden']. What happens if one of the fields is missing? Sure you can do property.get('has_garden'). However, what happens if you want to check the third sold price of the property? In this case, we only have 2 sold prices available, so that won’t be possible. We would need a function that first checks whether we have historical prices for the property, and then checks whether we have at least three records.

We could have many such functions that allow us to inspect other complex attributes. However, too many of those functions hanging around makes for a piece of code which is increasingly difficult to manage. It also kind of places the burden of managing access to this data in the wrong side: the consumer. If a certain property may or may not be a freehold is something that the property object itself should be able to say. The consumer of the data shouldn’t have to determine whether such a field exists and whether its value is correct. All of this should be handled by the data provider.

You could parse the data coming from your source (let’s say a JSON file as I suggested before) and build a more complete representation of it, with all the fields that the consumer expects. You could assign default values such as Noneto fields which don’t exist in the source.

This is good, but if you’re doing all this parsing to end up with just another dictionary, you’re still going to face problems specific to accessing data from dictionaries. Accessing data with indexes represented by strings is not really the best way to manage our code. What if someone makes a typo in a string? You could have a collection of enums that point to the string value of the fields. That would also protect developers against changes to the names of the fields. However, in that case you have you have to maintain the collection of enums, and although perfectly possible, it feels a bit unnatural to declare your keys with syntax like the following:

property[PropertyEnums.has_garden.value] = False

But perhaps the most annoying thing about keeping our data in that state is that, as you can see, even at this small scale the size of our data model is already looking overwhelming. Add a few more fields and you’ll quickly lose track of what’s going on. Got a missing field? All you’ll get is an IndexErrorwhen trying to retrieve it. Your editors won’t help, since dictionary values are not enforceable, and therefore won’t be able to trace a missing key. However, compulsory fields for the initialization of a class can be enforced and can traced. You’ll possibly be warned by your editor that a certain field is missing before you have a chance to run your code.

What would the previous property object look like if we implemented it as a data class? It could be something like the following:

first attempt at Property data class

To declare a dataclass, we need to use the @dataclass decorator from the dataclasses module to decorate our class. This decorator inspects the class attributes listed within our class declaration and uses them to build an object using the __init__() method behind the scenes. We need to specify the type of each attribute. The intialiser won’t enforce the types, but our text editor of choice might use them as hints to reveal inconsistencies in our code if we end up assigning them the wrong types. Compare this simplicity (and dare I say, elegance) of this data model declaration to the mess of the dictionary that we showed before.

This data class mirrors very closely the declaration of our previous dictionary. But once in the realm of classes, we can do better than that. Let’s refactor this a bit so that it becomes more readable and maintainable:

a better shot at data classes

This is looking much better. We can now smoothly access fields with dot notation. For example, the path to the property’s country becomes Property.address.country. As they stand now, all fields are compulsory when building the object, so the consumer of this data can rest assure that all attributes will be accessible.

Making all fields compulsory is nice for the consumer, but it can become cumbersome for the supplier of this data. Ideally, we could have a situation where the class itself knows which fields are available and return their values, and when they’re not, return sensible defaults. A pattern that I find very useful is to pass the raw data directly to the class, and let the class set the attributes’ values correctly. In our case, it might look like this:

a better way of managing properties data

Now this looks much better. Notice how each class is only expecting raw data, and then uses the __post_init__() method available to data classes to set all necessary attributes after the initialization. The Property class is doing the job of parsing each bit of data from the original raw dictionary, and making sure that the other classes get the right type of data that they’re expecting. When the original dictionary doesn’t contain values, the Property class sets a sensible default value for them.

And since we are using classes, we can leverage all the capabilities of Python classes to handle our data attributes in a robust manner. For example, using class properties we can make sure that the Sale class is able to process the date field and return the right object, or a default value of None if no date is available. If we needed to, we could also have a base class that performs some basic common checks on the attributes, or we could have mixins that provide collections of methods that useful for certain classes.

In addition to that, debugging becomes also a lot easier in this context. If we try to access an invalid attribute, we’ll get a precise error telling us which consumer has tried to access what attribute on an object from which class. Or if an error occurs during the initialization of an object, we’ll know at which point exactly the problem takes place, and we will be able to put a breakpoint somewhere in the class that allows us to inspect the state of the object and the data which is giving us trouble.

If all those reasons weren’t enough, using data classes will make your code look cleaner, more elegant, and dare I say, professional. It will definitely make it more maintainable. And altogether will make you look like an expert.

--

--