Better Python Programming for Data Scientists, Part 4: Introduction to Object-Oriented Programming

Published in

The Data Nerd

8 min readJul 16, 2022

The 4 pillars of object-oriented programming

I will never forget the day in grad school when a friend admitted she was completely stuck on a homework assignment for our intermediate data science course. She had been at the top of the class in the introductory class, so this came as a surprise to me. The assignment involved working with a starter notebook that included a custom class, and my friend had never seen that style of programming before and had no idea how to use the code.

With the beginner-friendly documentation for libraries like scikit-learn and the wealth of tutorials available, it is very easy to successfully navigate working on data science projects for a long time without every learning object-oriented programming. But eventually the day comes when you can’t avoid it any longer, and if you have been treating these library APIs as something akin to magic, it can be deeply confusing.

In this series we’ve looked at Python fundamentals (parts 1 and 2) and data structures. Now it’s time to leap in the world of object-oriented programming (or OOP).

Classes and Objects

Object-oriented programming is a paradigm that centers data rather than functions in its structure. What does that mean?

It means that if you want to, for example, create an address book app, your code is centered around containers for the necessary data, which in this case would be people, businesses, and other entities you might want to include in your address book.

This design involves two core components:

Classes: the abstract entity that will own data, have common attributes, and perform functions. In our example, people and businesses would be classes. They have names, addresses, phone numbers, and functionality might include retrieving (“getting”) or updating (“setting”) a phone number.
Objects: the specific instances of classes (when you initialize a variable as an object, this is known as instantiation), that have data values associated with them. In our example, each individual person or business record is an object. You can get or set the name, phone number and address of each individual.

If we look at a Python library like scikit-learn, each algorithm is a class. When you instantiate a model, you are creating an object of that class. If we wanted to create our own class, it would look something like the below code. In this example, we can create a classifier that assigns classes at random.

When working with object-oriented programming, you will often see property and attribute used interchangeably. They do have different meanings in Python though. An attribute is any variable that belongs to the class or object. A property is a special type of attribute that has __get__, __set__ and __delete__ methods and is created with the @property decorator as a function. Although they have functions, they are accessed in the same was as attributes (that is as object.property). This enables the property to not need to be declared and an initial value stored in the constructor. It is uncommon to need to create properties.

In my own work, I have had to use properties in the MABWiser library when I encountered a problem where our use of nested named tuples for two of our attributes was preventing us from being able to pickle our models because the pickle protocol doesn’t support nested classes. I changed the two attributes to be properties and in the property functions wrote logic to recreate the named tuples when accessed. This meant the named tuples were no longer stored, and we could safely pickle our models without losing functionality. You can see how this was implemented in the Github repository.

Data Abstraction and Inheritance

Data abstraction comes from the idea that there is some functionality that an individual class shouldn’t need to deal with. This can be functionality that is common to multiple classes, or it might be the case that there are two or more separate entities that comprise the larger whole which has its own emergent functionality. Examples of this are the wheels, doors, and engine that come together as parts of a car, generic functionality that a classifier needs to have regardless of specific algorithm, or the individual decision trees that are components of a random forest.

Sometimes you’ll see this phrased as there being irrelevant or unimportant things the user doesn’t need to know or care about. In this case, “user” means the person writing a class and using other classes in the process, not an end-user of the library or API. The person writing our hypothetical car class cares that the wheel class has turn and roll functions but doesn’t care how they are implemented.

Data abstraction is the separation of functionality into other classes, “abstracting away” the details you don’t need to worry about in order to keep your implementation simple and efficient.

Often the details that can be abstracted away are things that are common to multiple classes. In our address book app example, we might have an abstraction of Entry that has properties such as name, address, and phone number, while a Business has hours and a contact person, and a Person has a birth date. If we think of our RandomClassifier, if we want to add parallelization to it, that is probably something we will want to add to other classifiers through a generic Classifier class.

This leads us to the idea of inheritance, the arrangement of a hierarchy in which child classes or subclasses inherit behavior and properties from parent or base classes. Inheritance only goes in one direction, so anything defined in the child class is not available in the parent class. They also can override the values and functions implemented in the parent class without impacting the parent or other child classes. If a function is implemented in the parent class, that parent functionality can still be accessed internally by calling super().function() within the overriding function of the child class.

In addition to implemented behavior and properties, API definitions can be inherited as abstract methods. That is, all child classes must implement a function with a predefined function signature. This ensures for example that all of your classifiers have fit and predict functions that take the same arguments. We can see an example of this below:

Our example of RandomClassifier highlights a common challenge in inheritance — sometimes there are arguments required for certain child classes but not others. This is especially true for data science libraries where we want to have standardized APIs where algorithms can be used interchangeably in code without requiring updating your calls to fit and predict. Accepting the arguments and not using them is a simple solution that keeps the API consistent.

Another potential solution to this is known as overloading. Overloading specifically refers to having two or more functions with the same name but different function signatures. Python handles overloading by only using the last implementation of the function. If we implement the abstract methods of the parent class but omit the unneeded arguments, the new function signature will be used rather than the abstract one. As with overriding, the parent function can be accessed internally to the child class function by calling super().function(). If two versions of the same function are implemented within a single class, only the last one will be used.

For a real-world example of abstraction and inheritance in action, check out scikit-learn’s source code for decision trees. There is a BaseDecisionTree class that DecisionTreeClassifier and DecisionTreeRegressor inherit from. The generic decision tree functionality resides in the parent class, while the child classes deal with the functionality unique to each of them.

Encapsulation

Encapsulation actually means two closely related things in object-oriented programming:

The grouping of data with the methods that operate on that data; and
The restriction of direct access to some components of the data and methods.

If we think of a decision tree, the class for that will contain the tree data structure, all of the attributes of the tree such as the maximum depth and the number of leaves, and the hyperparameters that were provided, and it has the methods to compute these things and retrieve additional information such as feature importance. Everything we could possibly want to know about or do with a decision tree is encapsulated in this class.

There are some things, however, that the decision tree algorithm needs to know or needs to be able to do that the end user does not.

Every class, method and variable in Python is public. If you import *, every element in a class or module is imported and available for use (which is why importing * generally isn’t a good idea — you may not know all of the things you are importing and can have naming conflicts among other issues).

You may have seen methods and variables that have an underscore prefix. The single underscore (_name) means that it is protected. Protected attributes and methods are available for use in the class to which they belong and its child classes. As stated previously, everything in Python is public so it doesn’t stop you from using them external to the class, but it’s a way of signaling that you probably shouldn’t be using it.

There is also the double underscore (__name) which is used to mangle the name. When a mangled method or attribute is accessed outside the class including in a child class, the name is prefixed with _class to become _class__name. For example:

class TestClass:
  def __test_func():
    print('test')test_object = TestClass()# The function cannot be accessed externally under the original name
test_object.__test_func()
>> AttributeError: 'TestClass' object has no attribute '__test_func'# The function can be accessed externally under the mangled name
test_object._TestClass__test_func()
>> 'test'

The idea behind mangling is to avoid name collisions in inheritance so that parent and child can each have their own implementations or values. This goes back to the idea of overriding. Mangling makes the parent class’s functionality externally available while allowing the child class to use the same name internally for a different method or value. The child class would thus have in its externally available attributes or methods _ChildClass__name and _ParentClass__name.

Some people use mangling for private attributes and methods, those only accessible within a single class. As with protected ones, Python doesn’t stop you from using these externally, but it can be a way of flagging them and the user does need to know to prefix the attribute or method with _class.

You can see all of the attributes and methods of an object, both public, protected, and mangled by calling dir(object).

Polymorphism

Our last topic in this introduction to object-oriented programming is one that has already been mentioned a few times and doesn’t require further in-depth exploration, but we haven’t put a name to it yet: polymorphism is using a single interface for multiple classes.

Data science libraries that have a standardized API for diverse algorithm classes to allow them to be used interchangeably are leveraging polymorphism. Abstract classes can be specifically defined as interfaces with no implemented methods for this purpose. Polymorphism can also include child classes that don’t implement an abstract class but do override the functionality of the parent class. The key idea is multiple classes all having the same API but their own implementations under the hood.

Conclusion

Given that object-oriented programming is an entire programming paradigm, it is a big topic and we have just scratched the surface. We’ve covered the four pillars of OOP: abstraction, encapsulation, inheritance,and polymorphism. Next time we will build on these foundational ideas and take a look at some object-oriented design patterns that you may find useful.

Stay tuned for the rest of the Better Python Programming for Data Scientists series!