An in-depth look at objects and types in the C-layer of CPython interpreter
We have often heard that everything in Python is an object. Although this sentence is true, a question arises! C is not Python, Python is not C, so how can we write something in C, which then we will be able to call everything a object in Python? This is the question Guido van Rossum has answered about 33 years ago.
If you have seen the Python source code, you will encounter such lines:
As you may have guessed, these are functions for datetime
type. The return types of all of them are a pointer to the PyObject
type, also, the type of their parameters are PyObject
too!
These are the C API functions for list
in Python:
And this is also part of the most important file (in our context) and one of the the oldest source code files of CPython, which, as they say, This is iheritance built by hand :D
In this article we will talk about:
- a C
struct
calledPyObject
- a C
struct
for PythonTypes
tuple
object and type
A structure called PyObject
As I showed above, a cursory look at the CPython source code will show you how much PyObject
struct is used, in fact, when the interpreter executes our code in the interpreter loop (a very large loop that executes bytecodes one by one), considers all values in the evaluation stack as PyObject
. In other words, we can say that this PyObject
is the superclass of all Python objects (as we say in Python, the object
class is the superclass of all our classes and they all inherit from it.) Of course values in CPython are never declared as PyObject
s, but always A pointer to values *can* be cast to PyObject
.
A word about C structs
when we say: Values in Python are never declared asPyObject
s, but always a pointer to values *can* be cast toPyObject
. we are referring to an implementation detail dependent on the C programming langugage and how it interprets data at memory locations. C structs which are used to represent python objects are just groups of bytes which we can interpret in any manner which choose to. For example, a struct,test
, maybe composed of 5short
values each 2 bytes in size and summing up to 10 bytes. In C, given a reference to ten bytes we can interpret those ten bytes astest
struct composed of 5short
values regardless of whether the 10 bytes were actually defined as atest
struct — however the output when you try to access the fields of the struct maybe gibberish. This means that given n bytes of data that represent a Python object where n is greater than the size of aPyObject
, we can interpret the first n bytes as aPyObject
. (Inside the Python Virtual Machine, 2018, p. 37)
Here is PyObject
:
Which the macro is:
The job of the macro is to replace _PyObject_HEAD_EXTRA
with what is in-front of it, which will be:
which defines two fields _ob_next
and _ob_prev
that point to the previous and next created object, which is an implicit doubly linked list, which means that all Python objects are connected to each other like a chain :D
The ob_refcnt
field, which is a number in fact, is used for memory management, and the important ob_type
field is a pointer to another structure that specifies the type of that object. This pointer specifies that: It is this type that determines what the data represents, what kind of
data it contains and the kind of operations that can be performed on that object.
which can be looked at in Python, like this:
This name is referring to a string object whose type is str
.
The million dollar question
How this is possible?!
A word about reference counting
CPython uses reference counting for memory management; This is a simple method that when a new *name* is bound to an object (like the above example for name and that ‘obj’), the reference count of that object is increased, and vice versa, when a reference to an object goes away (for example, we usedel
keyword to delete a reference) reference count decreases. Note that when the reference count of an object becomes zero, that object is deallocated by the VM.
Very well, Til now we talked that all the values can be casted to PyObject
, and we also talked about the implementation detail that exists in C structs, but can we see it in the source code as well? Here are some examples:
What do they all have in common? Yes, exactly this macro:
This macro leaves an ob_base
field of type PyObject
for them, which makes it possible to cast them to PyObject
(according to the above explanation about C structs).
Type PyVarObject
sister of PyObject
First, let’s see the source code of the list
object:
There is no PyObject_HEAD
, however there another one, this macro:
which leaves an ob_base
of type PyVarObject
for the object, but what is this PyVarObject
that we know as PyObject
’s sister?
This is it:
This struct is used for objects such as list
s, which have some *notion* of length, and they keep the number of internal items of that object; And this is why the len
function in Python is always O(1)
😁 because it only has to read and return a value.
A structure for types
Types in CPython are defined by the _typeobject
struct. This C struct is actually the base of all the types used in CPython and it has many fields that are mostly pointers to other C functions that implement the functionality of that type.
which is:
All these field are nicely documented in here:
But let’s check this structure with an example.
tuple type
First of all the tuple
object:
And this is the tuple type:
- The macro at the beginning of this struct:
Do you remember we said that we can cast all objects to PyObject
? And do you remember that we introduced PyVarObject
, the sister of PyObject
, which this PyVarObject
itself could be converted to PyObject
? And do you remember what fields we had in PyObject
?
One field was ob_refcnt
and one field was ob_type
, and we are exactly setting the ob_type
of this struct with this macro! And we are going to answer that million doller question, yes we are setting its metaclass to type
type of Python.
This will cause us to see the following output:
Type is the default metaclass of all Python classes, which if we look at its source code:
This is the answer, the type of type
is type
!
Later in the article, I will explain why PyVarObject_HEAD_INIT
is used to bring the ob_size
field in this type.
tp_name
field:
This field is used to name this type, which is tuple in our example. Actually, __name__
reads this field and returns it to us:
This field, when the class is in a specific module or package, specifies the name of that module and package using dots in its name:
tp_basicsize
andtp_itemsize
fields:
in the documentation:
In fact, Python uses these fields to know how much memory it should occupy when creating an instance of this object and type. (let’s say an empty tuple
)
There are two cases for tp_itemsize
. According to the documentation, for objects that are variable-size (like our example), this field must store the size othe value they hold (which are all pointers from PyObject
which is 4 bytes in 32-bit and 8 bytes in 64-bit systems), and according to the documentation, their size is calculated as follows:
Remember, above I said that I’ll explain why the PyVarObject_HEAD_INIT
macro is used and why the field ob_size
should exist? So the documentation tells us (red lines):
It says that for variable-length objects, the field ob_size
must exist, why? Because it needs this field to calculate the size of that object, we need N, which is the *length* of that object, and that length value is stored in ob_size
. Look here:
You see that the size is increased with 8 bytes.
The second case that happens for tp_itemsize
is that the type is not a variable-length object and is a statically allocated type object, and in that case this field should be considered 0 for that object, like our own type
:
- The
tupledealloc
function for thetp_dealloc
field:
This function specified here, performs the action of destroying and deallocating that object from the memory. Interesting points mentioned in the documentation:
- This function is not defined for types that never disappear, namely
None
andEllipsis
!
2. For the second point, first see this code snippet from tupledealloc
:
You can see that a loop (by the number of objects that the tuple kept inside itself (Py_SIZE
actually returns the len
of that object)) is decreasing one of the references of the objects that the tuple kept inside itself. And of course, it is natural to see this, the tuple is being deleted from the memory and their is no longer a tuple to hold them.
You see, when we put s in a tuple, one was added to its references, and when we removed the only name that pointed to that tuple, that reference was also subtracted from s.
What does the documentation say:
3. But the documentation has other instructions to *how to write* this function:
All of which are followed in order in tupledealloc
:
- The
tuplerepr
function for thetp_repr
field:
The documentation:
But the underlined lines are one of the best practices for writing __repr__
: this function should return an str
value that, in the right conditions, if we pass it to the eval
function, it will create that instance for us; But what is less heard is that if it is not possible, it should return the str
value starting with <
and >
telling what type is that object and what is holds.
tuple_as_sequence
values:
This field, which specifies the Sequence Protocol for the tuple, and has these sub-slots:
This means:
- Tuples have a
len
value and we can pass them to thelen
function
- We can concatenate them with
+
- We can use repetition on them
- We can get their items with index
- And use the
in
operator
tuple_as_mapping
values:
If you are surprised, oh! We can use the slices on the tuples, so you might say
I must say that the function that provides the possibility of subscripting tuple
s is included in this protocol:
If you pay attention, it supports slides.
tuplehash
function:
This function also does not need any additional explanation:
But what about list
s? If we go to see its tp_hash
field:
Which brings us to:
It’s a recognition error, isn’t it? 😁
PyObject_GenericGetAttr
function:
According to the documentation:
But in short, this function implements the normal routine of getting an attribute from an object.
tp_doc
field:
This field sets the value of __doc__
and is actually the same thing that the help
function shows us:
tp_richcompare
field andtuplerichcompare
function:
The function placed in this field supports comparison operations for that type:
tp_iter
field andtuple_iter
function:
Suppose we want to write a class that can use the for loop on it, what should we do? We have to implement the iterable protocol:
In simple language, iterable objects are objects that give us an iterator object when we pass them to the iter
function! In the documentation of Python, it is mentioned again that if containers want to be iterable, they must implement the __iter__
method, and in the C layer, that type must give the appropriate function to the tp_iter
slot:
So, so far we have realized (and of course we knew) that tuple
is an iterable container!
Now, what is the iterator object?
An object that implements these two methods is called an iterator.
Now let’s review: if we want to (for example) do a for loop on an object, that object must have __iter__
, what does __iter__
return? An Iterator. What is the function of iterator? This means, every time we call the next
method (the for loop does this by itself), it returns an item to us, and when there are no more items, it raises StopIteration
exception. Now let’s see how this happens intuple
s.
First of all __iter__
:
which is this function:
Exactly as the doc said, it’s just returning an iterator.
Now let’s see _PyTupleIterObject
type:
A PyObject_HEAD
that we talked about, an index and a PyTupleObject
, that’s it.
But those of us who are reading this article are careful that this is an object, so what is the thing that it *must* have? Yes, it must have a type (so that ob_type
should be set by something)
Now what is the type of this PyTupleIterObject
:
What is important is exactly this tupleiter_next
function. Let’s see this function:
That is very simple. If you pay attention, it takes an PyTupleIterObject
and uses its index field, and takes the next item with PyTuple_GET_ITEM
and returns it.
tp_methods
field
This field stores the methods of that type, for example, tuple
s have the index
method, it is defined here:
This macro is:
which finally reaches this function (the tuple_index function calls this function):
You can see a for loop that checks the items one by one and returns the index if found.
tp_new
field:
And finally, the function that creates a new tuple
object:
This function is responsible for creating a new tuple object, and actually __new__ is considered to be of that type.
Relationship between types and objects
Simply defining and writing a C struct that says what fields this object is supposed to have, does not create and define a new object or type in Python. One of the essential fields of PyObject
is the ob_type
field, which is actually the most important struct defines the object, specify its creation and destruction, tells what protocols it supports and… . So it can be concluded as follows: from connecting an object (which has PyObject_HEAD
(or its variant like PyVarObject_HEAD
) and defining its ob_type
field as a struct of PyTypeObject
, we have a new object or type in Python that we can use.
😂😂😂 which becomes something like this:
If you want to write your own type and use it in Python, this tutorial can help you:
I hope you enjoyed the article.