Python Objects Part III: String Interning
In the first part of this Python Objects series, I gave a basic introduction to what objects are and how they work in Python. In the second part, I further fleshed out object instantiation with an examination of CPython shared objects. In this third part, I will pick up where I left off to focus on one last relevant implementation to fully understanding object instantiation in Python — string interning.
FIRST THINGS FIRST, AGAIN — A REFRESHER
Let’s review what we’ve learned up to this point:
- Everything in Python is an object, something that a variable can refer to.
- Objects are classified by their value, type, and identity (aka. memory address).
- The value of an immutable (unchangeable) object is tied to its identity — if the value changes, the object changes.
- The value of a mutable (changeable) object is not tied to its identity — identity is retained across changes made to the object.
- The CPython implementation of Python pre-allocates shared values, certain ranges of commonly-used immutable types.
- When Python is instructed to instantiate a new immutable object, it first checks to see if an identical object already exists as a shared object.
Also, note that the behavior discussed in this article is specific to CPython versions 3.0 and above. You are not guaranteed the same behavior on different implementations or versions of Python.
PYTHON STRINGS — UNICODE SEQUENCES
As I briefly mentioned in my last article, string objects in Python are really sequences of unicode characters, hence why they are specifically referred to as “text” sequences in Python’s documentation. This can be proven by comparing the identities of individual characters within a string:
>>> a = "Holberton"
>>> b = "School"
>>> a[0] is b[0]
False
>>> a[1] is b[3]
True
Recall that the second is
comparison returns True
because of shared objects. CPython loads the Latin-1 range of characters unicode decimals 0
to 255
, inclusive, as shared objects every time Python is initialized. Any calls to values in this range are referred to those pre-existing objects.
STRING INTERNING — WHY
Within CPython, unicode characters are stored as PyUnicodeObject
instances. We can view the format of a PyUnicodeObject
by looking at the source code:
As we can see, a PyUnicodeObject
stores characters according to one of three different encodings. Each of these encodings take up different byte sizes— 1 byte for Latin-1 encodings, 2 bytes for UCS-2 encodings, and 4 bytes for UCS-4 encodings. This sizing is accessible back in Python (the subtraction is required because the actual number of bytes required to store a string is greater than the size of its characters):
>>> import sys
>>>
>>> string = "H"
>>> sys.getsizeof(string + '?')- sys.getsizeof(string)
1
>>>
>>> string = "©"
>>> sys.getsizeof(string + '®') - sys.getsizeof(string)
2
>>>
>>> string = '🐍'
>>> sys.getsizeof(string + '💻')- sys.getsizeof(string)
4
The characters contained in a string determine the string’s size. For indexing purposes, each character in a string must take up an equivalent number of bytes — otherwise, operations such as slicing would be inaccurate.
If a string consists strictly of Latin-1 range characters, Python will use up as little space as possible and use entirely 1-byte character objects. But as soon as that string contains a UCS-2 character, all other characters must be converted to take up 2 bytes as well, even if they only need half that space. The same goes for an addition of a UCS-4 character.
>>> import sys
>>>
>>> s1 = "Holberton"
>>> sys.getsizeof(s1)
58
>>> s2 = s1 + "©"
>>> sys.getsizeof(s2)
83
>>> s3 = s2 + "💻"
>>> sys.getsizeof(s3)
98
As characters of different sizes are added to the string, the overall string size increases by more than just the size of the added character.
Now, you can probably imagine that between the proportionally-growing cost of character memory and the additional information Python allocates to store strings, as strings are called within a Python session, they can begin to take up much space. As a partial alleviation to this issue, CPython implements a special instantiation process for string objects — string interning.
STRING INTERNING — WHAT
String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.
Recall the following example from part one of this series:
>>> a = "Holberton"
>>> b = "Holberton"
>>> a is b
True
With our newfound knowledge of string interning, we can flesh out this example further. In the first line, no strings yet exist in memory, so the variable a
is immediately assigned to refer to a new instance of "Holberton"
. After the execution of this line, "Holberton"
is saved as an interned string. This takes place through calls to the following CPython functions:
At the second line, before creating a new instance of "Holberton"
, Python first checks its storage of interned strings to determine if the same string has already been instantiated.
Seeing that it has, it assigns the variable b
to refer to that same instance. No new string objects are instantiated, and in result, a is b
.
String interning may sound familiar to shared objects, and it should! Both the method and idea behind interned strings runs parallel to CPython’s implementation of shared objects. In fact, once a string has been interned, it is essentially equivalent to a shared object — the instance of that string is globally available to all programs executed in the given Python session. Just as with shared objects, by interning strings, Python can be more efficient both in time and memory.
STRING INTERNING — WHICH
If you’ve been reading this series in order, you probably do not trust me for my initial word, and for good reason. Python first checks its memory for identical immutable objects before instantiating new ones, but it really only checks shared values. Shared objects can be used to understand the instantiation process behind all immutable objects, except string objects.
Thus, it might not surprise you that string interning truly only applies to certain types of strings. At least I’m consistent? 😅
It wouldn’t make sense for Python to permanently save in memory every single string that is called — that would end up more wasteful than not. Instead, Python tries its best to exclusively intern the strings that are most likely to be reused — identifier strings. Identifier strings include the following:
- Function and class names
- Variable names
- Argument names
- Dictionary keys
- Attribute names
Note that Python is not actually detecting the above — it would not even be capable of it in the first place. Merely, the implementation of interned strings was written in the interest of trying to catch identifier strings according to a standard. This standard is rigorous, and goes as follows:
1. The string must be a compile-time constant.
A string will not be interned unless it is loaded at compile time as a constant string. This includes strings defined as expressions — remember that an expression is evaluated first before an object is instantiated. Any string constructed at runtime (ie. any strings produced through methods, functions, etc.) will not be interned.
>>> a = "Holberton"
>>> b = "Holberton"
>>> a is b
True>>> a = "Holberton"
>>> b = "Holb" + "erton"
>>> a is b
True>>> a = "Holberton"
>>> b = "".join(["H", "o", "l", "b", "e", "r", "t", "o", "n"])
>>> a is b
False
The idea here is that constant strings are the most likely to be identifiers that will be used repeatedly.
2. The string must be not be subject to constant folding or no longer than 20 characters.
This one can be confusing, so let’s break it down step-by-step. First, constant folding. By constant folding, I refer to the calculation of expressions. For instance, in the example from condition one, at compile time, the two constants in the expression "Holb" + "erton"
, are “folded” and received as just a single string, "Holberton"
.
Now, in general, strings longer than twenty characters will be interned (assuming it additionally meets condition three). However, if a string is the product of constant folding, and is longer than 20 characters, it will not be interned.
>>> a = "HolbertonHolbertonHolberton"
>>> b = "HolbertonHolbertonHolberton"
>>> a is b
True>>> a = "Holberton" + "Holberton" + "Holberton"
>>> b = "Holberton" + "Holberton" + "Holberton"
>>> a is b
False
The idea here is that a string produced by constant folding and longer than 20 characters is most likely not an identifier. Most likely.
When in doubt over whether a long string will be interned, step through two questions. First, is the string longer than 20 characters? If not, proceed to condition three. If so — is the string the production of constant folding? If not, proceed to condition three. Otherwise, the string will not be interned.
3. The string consists exclusively of ASCII letters, digits, or underscores.
Nothing to misinterpret here — if a string contains any character not an ASCII letter, digit, or underscore, it will not be interned. Period.
>>> a = "Holberton School"
>>> b = "Holberton School"
>>> a is b
False>>> a = "Holberton_School98"
>>> b = "Holberton_School98"
>>> a is b
True>>> a = "Holberton_School98!"
>>> b = "Holberton_School98!"
>>> a is b
False
The idea here is that variable identifiers should not contain characters beyond these types.
ONE EXCEPTION — EMPTY STRINGS
The above three conditions are thorough and mandatory when it comes to interning strings. If even one of the conditions is not met, a string will not be interned.
A single exception exists that must be mentioned — empty strings. Empty strings are interned.
>>> a = ""
>>> b = ""
>>> a is b
True
Finally, on top of the above interning conditions, allow me to give another friendly remember that the unicode characters decimals 0
to 255
are loaded as shared objects every time a Python session is initialized. These strings are not interned, per se, but since they are shared, they will behave identically to interned strings — but only this range!
>>> a = chr(169)
>>> a
'©'
>>> b = chr(169)
>>> b
'©'
>>> a is b
True>>> a = chr(937)
>>> a
'Ω'
>>> b = chr(937)
>>> b
'Ω'
>>> a is b
False
Alas, with much ado, we’ve closed the door on Python object instantiation. At this point, you know all there is to know about how Python instantiates objects!
To test your skills, I’d recommend opening a Python session and running through a gauntlet of examples similar to what I’ve done in these last three articles. Throw crazy combinations of objects together and see if you can predict the result of an is
comparison. With your newfound knowledge of shared values and string interning, you should be able to successfully predict all of them!
In the next and final part of this series, I’ll close the circle on our examination of Python objects with a look at how objects are represented — classes.
TL;DR:
- CPython stores strings as sequences of unicode characters.
- Unicode characters are stored with either 1, 2, or 4 bytes depending on the size of their encoding.
- Byte size of strings increases proportionally with the size of its largest character, since all characters must be of the same size.
- To alleviate memory that can be quickly consumed by strings, Python implements string interning — aka. string storage.
- A string will be interned if it is a compile-time constant, is not the production of constant folding or is not longer than 20 characters, and consists exclusively of ASCII letters, digits, or underscores.
- Empty strings are interned.
Python String Interning Chart:
Completed Python Object Instantiation Chart:
More From the Python Objects Series: