(Im).mutable ->(OOP)bjects and CPython

Python object handling from a C perspective

Image Credit: https://docs.sencha.com/extjs/6.0.2/guides/other_resources/images/classes_instances.png

Introduction

The Python programming language uses data structures in novel ways. It builds off of primitive data types in C to form a data structure called the PyObject, which is a super class all data types inherit from. Because of this abstraction, everything in Python is an object. This means that everything in Python is a class that can have its own methods and attributes, in other words: its own functions and variables. For example: in C, a signed integer can have a value from -2,147,483,648 to +2,147,483,647 but in Python, a signed integer can be of any value, going way beyond those limits.

In this article, I attempt to provide a C-level explanation of how Python handles objects and object mutability.


The Root of All Classes, the PyObject

In order to start talking about objects, the PyObject must first be described, because all other objects inherit from this struct.

typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
Source code is found here: Python-3.4.3/Include/object.h from lines 105–109.

Here, _PyObject_HEAD_EXTRA is a macro that expands to

#define _PyObject_HEAD_EXTRA             \
struct _object *_ob_next; \
struct _object *_ob_prev;
Source code is found here: Python-3.4.3/Include/object.h from lines 70–73

It indicates that part of the definition of the PyObject struct are two pointers, _ob_next and _ob_prev (of type _object ) that “support a doubly-linked list of all live heap objects” (comment from object.h, line 69). Any new object that is declared will have space allocated for it in the heap, and will be doubly linked to the previously allocated object.

The line: Py_ssize_t ob_refcnt; counts the number of dynamically allocated references to any livePyObject. The count is used to keep track of memory allocation from the heap. ob_refcnt declared to be the type Py_ssize_t, which is used wherever a C-level signed integer is needed to index Python sequences. (See: Tim Peters on Stack Overflow)

The final line struct _typeobject *ob_type; is the meat of the struct. The type _typeobject declares variables that control the way objects are put into memory, deallocated, provide messages for printing or documentation, get methods or attributes for objects, etc. The *ob_type pointer is the abstraction that provides the basic framework for every class of object. For brevity, I will not include the source code here, but refer to the following C file: Python-3.4.3/Doc/includes/typestruct.h.

The flexibility of having such a super class enables the language to be loosely typed. Functions do not have to strictly define what they can accept, objects can be cast easily and methods can be called on any type of object without explicitly casting (some of the time). Because it is a loosely typed language, input checking and error handling should be performed when needed to ensure the correct type of data is used.

Here is a very simple example of how loosely typed Python really is:

>>> a = 4 
>>> b = 5.24
>>> a
4
>>> b
5.24
>>> a = [1, 2, 3]
>>> a
[1, 2, 3]
>>> a = b
>>> a
5.24
>>> b
5.24

Here, I assign a to be a whole number: 4, and b to be a floating point number: 5.24. Then I assign a to be a list object, [1, 2, 3]. Then I assign a to be b and voilà: a is now a floating point number. It changed between 3 types without any type casting whatsoever! All this is due to each object inheriting from the super class PyObject.


Classifying Objects with id( ) and type( )

Because everything is an object, each variable can provide data about themselves when requested. There are many builtin functions that can return data about the objects, the ones covered in this article are id(), type(), and isinstance().

id()is a builtin method that returns the memory address of the object passed into it. It is used as follows:

>>> a_list = [1, 2, 3]
>>> id(a_list)
140603371157192
>>>

The id() function is defined as follows in the source code:

static PyObject *
builtin_id(PyObject *self, PyObject *v)
{
return PyLong_FromVoidPtr(v)
}
PyDoc_STRVAR(id_doc,
"id(object) -> integer\n\
\n\
Return the identity of an object. This is guaranteed to be unique among\n\
simultaneously existing objects. (Hint: it's the object's memory address.)");
Source code is found here: Python-3.4.3/Python/bltinmodule.c from lines 996–1006.

The function PyLong_FromVoidPtr(v) converts the object from the value of a pointer (a hexadecimal memory address) into a Python object. (The second part, starting with PyDoc_STRVARis documentation for the help(id) function call.)

The id() function may return the same memory address in some situations, such as

>>> a = 15
>>> b = 15
>>> id(a)
8922176
>>> id(b)
8922176

This is because these two integers ARE the same object. Before the variable a is assigned, the interpreter is loaded into memory and an array of 261 integers is preallocated. The values increase by 1 from -5 to +257 in this array, to speed up accessing small numbers. These values are expanded from the macros NSMALLNEGINTS(-5) and NSMALLPOSINTS (+257). The function int _PyLong_Init(void);fills the array linearly in a for loop, with error checking. This means that any number between the values of -5 and +257, will always have the same memory address, because they are allocated as the interpreter is loaded into memory. These small integers are also immutable because of this preallocation.

Source code is found here: Python-3.4.3/Objects/longobject.c from lines 5075 to 5111.

type()is also a built-in function and returns a string that contains the class name and other information if requested. Here is an example of some ways to use type to get information about an object’s class:

>>> a = [1, 2, 3] 

>>> type(a)
<class 'list'>
>>> type(a).__base__
(<class 'object'>)
>>> type(a).__mro__
(<class 'list'>, <class 'object'>)
>>> b = True
>>> type(b)
<class 'bool'>
>>> type(b).__base__
(<class 'int'>)
>>> type(b).__mro__
(<class 'bool'>, <class 'int'>, <class 'object'>)

This example also displays how type can be used to display how objects inherit from multiple classes in a hierarchy. Both a and b are instances of the object class. a only inherits from one other class: the list object, while b inherits from both the int and bool classes. For a list of all available methods, try type(object).__dict__ .

This is how type() is implemented in the source code:

PyObject *
PyObject_Type(PyObject *o)
{
PyObject *v;
        if (o == NULL)
return null_error();
v = (PyObject *)o->ob_type;
Py_INCREF(v);
return v;
}
Source code is found here: Python-3.4.3/Include/abstract.h from lines 30–40.

The first few lines are declarations of a new python object and a check to see if the pointer is null. The next line v = (PyObject *)o->ob_types; assigns the value of the new PyObject to be the ob_type field of the object that was passed into the function. The structure pointed to by o is cast into the base PyObject to get to the field ob_type, which is stored in the base class.

The methods type() and isinstance() may seem like they can be used in the same situations, but they have quite different functions.

>>> a = [1, 2, 3]
>>> b = True
>>> type(a)
<class 'list'>
>>> isinstance(a, list)
True
>>> isinstance(a, int)
False
>>> type(b)
<class 'bool'>
>>> isinstance(b, list)
False
>>> isinstance(b, int)
True
>>> isinstance(b, bool)
True

isinstance() recursively searches through the inheritances of the passed object to find the base object class in its inheritance hierarchy. The actual recursive function is defined in Objects/abstract.c in the function: PyObject_IsInstance, lines 2485–2512. The method isinstance(True, int) will always go to the base class, which is an int. The reason it returns True when called like this: isinstance(True, bool), is because bool is a subclass of int. This is different from type()'s functionality because type() only displays the last inherited class.

In the source code:

static PyObject *
builtin_isinstance(PyObject *self, PyObject *args)
{
PyObject *inst;
PyObject *cls;
int retval;
    if (!PyArg_UnpackTuple(args, "isinstance", 2, 2, &inst, &cls))
return NULL;

retval = PyObject_IsInstance(inst, cls);
if (retval < 0)
return NULL;
return PyBool_FromLong(retval);
}
Source code is found here: Python-3.4.3/Python/bltinmodule.c from lines 2159–2173.

Two local pointers to PyObjects are created, inst (instance) and cls (class) and a status value, retval. C does not contain Tuples, so it calls a method to unpack the tuple that may have been passed into the function because isinstance()can also be called: isinstance(variable, (class1, class2, class3)). While calling the function PyArg_UnpackTuple(args, "isinstance", 2, 2, &inst, &cls)) the PyObject pointers are assigned a value while checking for the NULL.


Putting it all together

Some immutable objects are ints, floats, tuples, strings and frozensets, and some mutable types are lists, sets, dictionaries and some user defined classes. Here, we can observe the differences when working with mutable data types and immutable data types.

Strings may seem mutable, but if a string is concatenated with another string, string += other_str, a third string is created out of the first two. To save on memory, if two strings are the same, a new pointer is created that points to the same spot in memory.

Below is this example illustrated with the id() function:

>>> a = "concat"
>>> b = "concat"
>>> c = "enate"
>>> id(a)
139904216780000
>>> id(b)
139904216780000
>>> id(c)
139904216780056
>>> print(a)
concat
>>> a += c
>>> print(a)
concatenate
>>> id(a)
139904216801712
>>> a += b
>>> print(a)
concatenateconcat
>>> id(a)
139904216784520

What is more memory efficient for more than two strings is to append the strings to a mutable datatype, (in this case, a list) then convert that into a string.

>>> new_list=["concat"]
>>> d = "enate"
>>> id(new_list)
139904217591048
>>> id(d)
139904216780056
>>> for num in range(1,4):
... new_list.append(d)
...
>>> print(new_list)
['concat', 'enate', 'enate', 'enate']
>>> id(new_list)
139904217591048
>>> new = "".join(new_list)
>>> print(new)
concatenateenateenate
>>> id(new)
139904176767744

Here, instead of creating five strings only two are created. When it appears that you change an immutable object, it actually changes the identity of the object. That is: the memory address of the object will change to a different location in memory.

In conclusion, memory addresses are displayed with the id() function, while information about different levels of class inheritance are displayed with the functions type(), andisinstance() These built in class methods are inherited from the base class, object as defined by the struct PyObject and all live objects (objects that are in scope of the local functions) are stored in heap memory as a doubly linked list. Objects inheriting from the PyObject causes the language to be loosely cast, reducing the need to explicitly typecast objects into other types. Though it may seem that Python has simpler syntax than a structural language like C, there is a lot going on underneath the hood that allows the concepts to manifest.