Objects, Memory, and Mutation in Python
Introduction:
Many people choose Python as their first programming language because it has a nice syntax and takes care of a lot of details automatically that you have to worry about in lower-level languages, e.g. memory. Nevertheless, under the hood the way the Python interpreter manages its memory is quite a bit more complicated than would be found even for an analogous program written in C, for instance. While you may be content allowing the interpreter (typically, CPython, the C implementation of Python) to take care of all of this for you, there will inevitably come a time when you must dig a bit deeper to resolve a bug or understand your program’s behavior. In fact, understanding objects and their capabilities will lead to more efficient and cleaner code.
Id and Type:
Python is an Object-Oriented Programming (OOP) Language with many features that support the OOP paradigm. In contrast with a procedural language, like C, Python allows you to create objects and classes, which can be given specific attributes, such as methods and properties. Without going into too much detail, objects allow you to avoid some of the pitfalls of procedural languages, such as code duplication, and provides additional checks on how code can be used through encapsulation and polymorphism. Unlike some OOP languages, Python is dynamically typed, which means that you don’t need to declare what kind of data (e.g. integer, array, etc) a variable can take before using it. Furthermore, you can switch types at any time. The reason this is possible is that everything in Python is the same type.
Let that sink in for a second.
What, then, makes and int
different from a str
? It is true that they have different properties and uses, however in the implementation (I will use CPython as my reference point as it is the canonical one) all Python objects are instances of a PyObject
structure. While an int
does differ from a str
the objects are passed around as a PyObject
pointers and only casted to their definitive type when needed.
The memory management system used by CPython is similar to that used by normal C programs. When a new object is created, the interpreter uses malloc
to request memory from the OS. Every object lives somewhere in memory (on the heap), the location of which can be determined using the built in id()
function. Calling id
with a variable name as an argument will return the address of the object that variable name refers to:
>>> a = 'word'
>>> id(a)
140441823802288
>>> hex(id(a)) # the more familiar hex representation of the address
'0x7fbb2904f3b0'
The address of the object is where its data is stored in memory. Since all objects minimally possess all the attributes of a PyObject
I can reassign a
to another type without causing any issues:
>>> a = [1, 2, 3]
>>> hex(id(a))
'0x7fbb29030148'
However, as you can see, the address has now changed. The name associated with object doesn’t matter, but different types do have different memory requirements, so it doesn’t make sense to put them all in the same place.
The type of an object can be queried using the type()
function. Calling type
on an object will return the class of which that object is an instance. Since in OOP all objects belong to a class, you can use type
on all objects, even classes themselves:
>>> type(a)
<class 'list'>
>>> type(list)
<class 'type'>
A is an instance of a list
so it’s class is list
. list
itself is a class, not an instance, but it is an object nonetheless, and it derives from the type
class, which is the most basic object in Python. All objects inherit from type
at some point in their class hierarchy.
Although type refers to the class of an object, variables themselves have no type. Instead, you can think of variables as labels applied to objects. This is partly why there is no issue in reassigning a
to a list
after it was assigned to a str
. This is because it is merely adding a
to the list of names that can be used to refer to the specific list
instance. Since a variable name is just an identifier, it can be used to refer to any object.
Variable names refer to objects “by reference”. This means that they are always a pointer to the memory location of the variable. The only data that variable name directly stores is the address returned by id
. This is why it is possible to do things like this:
>>> a
[1, 2, 3]
>>> b = a
>>> b
[1, 2, 3]
>>> b.append(4)
>>> a
[1, 2, 3, 4]
By assigning a
to b
we are really assigning id(a)
to b
. Since list
objects can be mutated in-place the effect of calling b.append
is the same as though it were called using a
.
Mutable Objects:
While all objects inherit from the same base class and wear the same “hat” in the CPython implementation, they each can have specific methods and behaviors. One of the most important distinctions is their mutability. Certain objects can have their values change during their lifetime. For instance, a list
instance can have data appended or changed anywhere within the list without having to make a copy and transpose everything to a new location in memory. Mutable objects support assignment operations in-place, as in the above example.
Even though all variables are a reference to their objects, not all objects are mutable. The types of objects in Python that are mutable are list
, dict
, set
, and almost any user-defined class. All of these objects support in-place operations.
There are two kinds of mutability to consider. One is the issue of in-place operations. The other is attribute assignment. For the latter, even built-in mutable types, like list
are not mutable in this sense. They prevent dynamic attribute assignment:
>>> a = [1, 2]
>>> a.word = "word"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'word'
>>> class A(list):
... pass
...
>>> a = A([1, 2])
>>> a
[1, 2]
>>> a.word = "word"
>>> a.word
'word'
By creating a new class A
that inherits from list
we can regain the ability to add new attributes to A
type objects since the mechanism used to prevent it is not inherited.
Immutable Objects:
Unlike mutable objects, immutable objects can’t be changed in-place. There are many reasons for wanting this behavior, such as speed and safer data manipulation. Examples of immutable objects include tuple
, int
, float
, frozenset
, and str
. The data stored by these objects is set at creation. Just because they can’t be changed, however, doesn’t mean you can’t reassign variable names to new instances of these objects. For example:
>>> a = (1, 2)
>>> id(a)
140441783708296
>>> a[0] = 9
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> a = (9, 2)
>>> id(a)
140441783708040
Another import use of immutable objects is as keys to dictionaries. dict
objects support fast lookup using the hash value of an object. If the keys used for hashing were not immutable either the key associated with a value in a dict
could change, and thus be unfindable later, or the key used to access the value could be changed and also prevent finding the item. Furthermore, if you have an object that is guaranteed not to change over the course of its lifetime, it’s easier to place in memory, knowing that it will never need to be resized.
Several types have both mutable and immutable analogues. For example, tuple
is an immutable version of a list
, and frozenset
is the immutable version of a set
. The reason for having both is that many times you want to pass arguments to a function without having to worry about modifying the original data.
Unlike tuple
and frozenset
, there are no mutable counterparts to int
and float
objects. There are several reasons for this, but it is intuitive that such atomic types like integers shouldn’t be reassignable. The main reason that int
s and float
s are not mutable comes down to the unified object interface. In Python everything is an object. Even numbers have methods and attributes, which allows things like operator overloading and the use of specialized methods. Also, since Python supports arbitrarily large integers, the amount of space reserved for each instance is kept to a minimum (down to a certain size) so extremely large and small values are both be used without wasting much space.
Unless you are building an application that does a lot of computation with large numbers, most of the time you will be using numbers either to count or iterate over other objects, such as lists. Given this propensity to count, usually starting from 0, upon startup, CPython pre-loads/allocates space for the numbers -5 through 256, as these are most commonly used. Then, when assigning a number to a variable name, if the number is within this range, then the variable simply becomes a reference to a preexisting object:
>>> id(1)
10055552
>>> a = 1
>>> id(a)
10055552
>>> b = 1
>>> id(b)
10055552
>>> a = 257
>>> id(a)
140441784021520
>>> b = 257
>>> id(b)
140441797668816
>>> id(257)
140441784021648
As you can see, a
and b
both refer to the same object that represents the number 1. However, beyond 256, the object is created dynamically and is not the same each time it is used. Also, these objects are stored in a large array of 32 byte C structs so they are quickly accessible:
>>> id(10) - id(9) == id(9) - id(8) == id(8) - id(7) == 32
True
The fact still remains that variables store an object reference so if you create a new int
, such as 257, then assign that to a new variable, you are creating two references to the same object:
>>> a = 257
>>> id(a)
140441784021520
>>> b = a
>>> id(b)
140441784021520
>>> b += 1
>>> b
258
>>> id(b)
140441784021648
But as you can see, if you increment b
, it gets a new address, since int
objects can’t be mutated.
Strings are similar to int
objects in many ways. For one thing, a string is a sequence of integers. ASCII is a representation of characters using 1 to 127, so at their base, ASCII strings are an array of 1 byte integers. Python strings use unicode, so they require larger storage for each character, but work similarly in principal. Analogous to the optimization strategy used for integers, str
objects below a certain length are stored in a table of constant strings. If two strings need to be compared and meet the criteria for storage (it is technically called interning) they can be compared by address rather than character by character since the interpreter knows to first look at the table of constant strings. Thus, things like this can happen:
>>> s1 = "A random word"
>>> s2 = "A random word"
>>> s1 is s2
False
>>> s1 = "Arandomword"
>>> s2 = "Arandomword"
>>> s1 is s2
True
Apparently, strings without spaces are interned, but those with spaces are not.
Why Does the Treatment of Mutable and Immutable Objects Matter:
I have shown some of the effects that can be seen from using immutable and mutable types. But there are some more subtle differences that arise when passing objects into functions. Since all variables are passed by reference in Python, the function they are passed to can directly manipulate the data of their arguments. That’s one reason why tuples
are so useful: they act like lists but prevent the callee from accidentally changing the data in the caller’s scope. One consequence is that if you pass a mutable data type into a function you should be prepared for it to change. Since immutable types cannot be changed you must also be mindful of when operations performed inside a function will not propagate to the caller, e.g. operations of ints
. Furthermore, not all operations on mutable types have the effect of mutating them in-place.
For list
objects, any slice operation makes a copy of the part of the list being sliced. Adding two list
objects together also creates a new list. However, inline addition does not create a new list
:
>>> a = [1, 2]
>>> id(a)
140441783766856
>>> a = a + a
>>> id(a)
140441780552200
>>> a
[1, 2, 1, 2]
>>> a *= 2
>>> a
[1, 2, 1, 2, 1, 2, 1, 2]
>>> id(a)
140441780552200
So, inline operations are performed in-place, while the longer version is not. This is particularly interesting as the inline version is usually seen as a shorthand for the longer one, but the effect is different. Thus, in a function you would have two drastically different results depending on which method you chose. In general, regular assignment (not inline operations, such as *=
) creates a new object unless it is purely of the form a = b
.
In the case of immutable objects, all mutating operations give rise to new objects, so if you actually want to change the value of a variable you either need to reassign it by capturing the return value of a function or do the operation in the block in which the object was created.
One last oddity can be seen in the treatment of tuples
. As described above, tuples
are immutable. They do not support item assignment. However, it is possible for an element of a tuple
to be a mutable type, such as a list
or dict
. Because the tuple
is merely an ordered collection of references to other objects, as long as the address of the object it contains does not change, the tuple
will still be unmutated. One consequence is that in order to use a tuple
as a key in a dictionary, you must not use mutable objects as elements. The interpreter will give you an error if you try. Conversely, even though dictionaries are mutable, the keys must be immutable in ordering for hashing to work properly. So there is so overlap in these types.
Conclusion:
Knowledge of the underlying objects and how they behave is important to getting the desired output. It also helps to understand some of the language design considerations. Hopefully someone found this helpful.