Python Dataclasses With Properties and Pandas

Sebastian Ahmed
The Startup
Published in
5 min readFeb 10, 2021

--

At some point in my Python OOP journey, I came across dataclasses. I vaguely recall this coming about when generating seemingly repetitive boilerplate code for a class that did little more than containing a rather lengthy data-record. Hoping there was a better way to do this lead me to discover dataclasses which were introduced in Python 3.7 (via PEP 557).

In a nutshell, the dataclasses make developing and maintaining data-centric classes simpler.

In this post I wanted to share some discoveries as they relate to more interesting class implementations and also how such “interesting” dataclasses can be easily leveraged by the Python data-analysis module pandas.

An “Interesting” Data-Class

The following list constitutes what I would consider “interesting” in the sense of what might happen in real-life when creating a dataclass:

  • Properties which generate attribute values — perhaps with no underlying class attribute (i.e. a pure property). These can be thought of as data-processing layers with attribute semantics (vs methods)
  • Read-only attributes
  • Initializer-only fields
  • Properties which generate attribute values based on initializer-only fields
  • Attributes initialized with an optional list or dict type
  • Attributes which we do not wish to export to a pandas DataFrame

Below is how such a class might look. It should be noted that DataClass_Modern implements all the “interesting” features listed above

  • attr5 is a pure property which is read-only and derives its value from the initializer-only field attr0. There is no underlying class-variable that reflects the value of attr5. Note the dummy setter, which was required by the dataclasses module to make it believe it was settable. We also had to define this with the init=False field-control attribute to exclude it from the __init__() method generation
  • attr4 takes a list for initialization which required a default_factory to return an empty list object when a list was not provided
  • _attrHidden is a copy of the initializer-only attr0 and intended to be hidden. We perform this copy in the __post_init__() method
  • rand_factory is a classmethod with the sole purpose of generating randomized objects of this class. More on that below as we discuss pandas

Constructing a pandas DataFrame

Now that we have defined our dataclass with the various “interesting” features, lets see how seamless it is to get a pandas DataFrame constructed from a list of random instances of DataClass_Modern.

Inspecting the code above, one can immediately see that it is very simple. First we generated a list of 100 random objects using our rand_factory() method. Then we simply pass this list directly to the default DataFrame constructor.

Too easy!

But what about the following?

  • Does attr5 get exported to the DataFrame like we expect?
  • Does attr0 and _attrHidden get suppressed from being exported to the DataFrame as we might expect?

The answer to both questions is a big YES. See the output below:

pandas DataFrame construction with dataclasses objects

Should this be a surprise? Somewhat. How would this have worked if we had not used dataclasses?

Constructing a pandas DataFrame with a “classic” class

The code snippet below is an equivalent data-centric class definition with all the “interesting” features we wanted using a standard Python user-defined class structure. Now, granted there are a few ways this could have been done, but here we’ll keep things simple and make it as close to a basic user-defined class as possible.

At first glance it is clear that this took a bit more work (and it is in fact incomplete as we’ll see shortly). Note how we had to treat the list initializer and how we had to provide our own __str__() method. We didn’t however need to provide the dummy setter for attr5.

In the code-snippet above, one should see that we are using the special from_dict constructor method. This is because we are using the __dict__ built-in method for the DataClass_Classic objects to get the attributes. This looks simple enough, but what does the DataFrame actually look like using this similarly terse approach?

pandas DataFrame construction with “classic” objects

Two problems:

  1. _attrHidden is being exported. We didn’t want that. Not surprising because __dict__ will indeed include “hidden” attributes (i.e. those starting with an underscore). In case you are wondering, a dunder (double-underscore) doesn’t help. In fact, the name-unmangled version (with full class-name prefix) is exported to the DataFrame
  2. Where is attr5 ?? Nowhere to be found it seems. This is because __dict__ is a dict of class attributes when called on an object. The attr5 attribute is in fact an illusion. It is simply a method decorated with syntactic sugar as an alias for the getter() method. Note that adding a dummy setter method does not help. Creating attributes this way is a very common pattern in Python because it can provide a data-processing layer with attribute (vs method-calling) semantics

It turns out that in order to fix the above two issues, we actually need to write a custom method to export a controlled dict instead of using the __dict__ built-in. This is not a huge deal, but now we need to customize and maintain this method which was totally unnecessary when using dataclasses. The full example showing the custom method is contained in the following project:

https://github.com/sebastian-ahmed/python-etc/blob/main/dataclasses_pandas/example.py

Having said this, in order to have very advanced control of exporting a user-defined class to a DataFrame, the classic approach is still more powerful if you want to do things like control which attributes are not included during construction at run-time or provide object handles as a pandas column (but this is beyond the scope of this article).

Observations

I would typically mark the last section of any blog post as “Conclusions”, but that would be misleading in this case. Instead we have simply made some observations:

  • Using dataclasses reduces repetitive and boilerplate code. For simple data-centric classes the dataclasses based implementation will always be much more succinct (and thus easier to read and maintain). dataclasses have many more features such as provision for comparisons of objects which were not discussed in this article which again makes them very suitable for data-centric applications
  • It is possible to implement more advanced attribute behaviors with dataclasses such as read-only generated attributes, initialization-only fields, and factory-based initializers (which can also employ lambda functions)
  • dataclasses provide a very seamless interface to generation of pandas DataFrames. Surprisingly, the construction followed the semantic intent of hidden attributes and pure property-based attributes
  • More advanced construction of DataFrames may require the use of standard Python classes. The test of this is usually evidenced by having to contort a dataclass implementation to achieve something that would otherwise be simple with a regular class

--

--

Sebastian Ahmed
The Startup

Technology Leader | Silicon Architect | Programmer | Cyclist | Photographer