Well… it might. And should you intern your strings? …maybe.
As a quick recap: interning is the process whereby Python will instantiate only one of an object, making it a singleton object that exists only once in your program, at one place in memory. It does this with the numbers -5 to 256 inclusive on initialisation, so that there is for example only one object at one place in memory in your entire program that represents the integer
100. I wrote an explainer of this in my previous post (and as a disclaimer, this interning depends on the version of Python you’re running, but for our purposes we’ll assume it’s the normal CPython).
So, will it intern your string objects just the same? That is to say, if you try to create two strings with the same characters in them, will Python just create and refer to a singleton object in one place in memory? The answer is… sometimes:
>>> a = 'Here is a string'
>>> b = 'Here is a string'
>>> a == b
>>> a is b
False>>> c = 'here_is_a_string'
>>> d = 'here_is_a_string'
>>> c == d
>>> c is d
As you can see, Python considers
d to be the same object; or in other words, the variables
d both point to the same memory address. But this isn’t the case for
Well, you might note that
here_is_a_string looks somewhat like a Python variable name in itself- it’s in snake case, with no spaces or non-ASCII characters. Python’s variable names and other identifiers (class names, variable names etc.) are themselves interned, in order to optimise the speed at which your program can run. Since these names are effectively strings in themselves, the same rules apply to string objects under the hood.
So, in short, and to somewhat oversimplify things: any string that contains only letters, digits and underscores will generally be interned.
In practice, it’s a bit more complicated (as always) and there are a few exceptions. If you’re interested in a deep dive into what happens under the hood and the specific situations in which strings are interned by default, Brenan D Baraban has a good explainer for you.
If you want, you can force Python to intern strings, including strings with spaces or characters like
~. Here a couple of advantages you could get from doing so:
- Python can do comparisons much, much faster. If you need to do very large numbers of string comparisons, interning your strings will enable you to use the much faster
isoperator rather than
==, because rather than iterate over each string and compare whether its characters match one by one, equality can be determined by checking memory address alone as two string variables with the same characters will simply be at the same address.
- If you’re working with a dataset that includes an extended body of text in separate strings (such as if you’re doing Natural Language Processing), you can dramatically reduce the memory used if all those repeated words (
aetc.) are only instantiated once. One such example used interning to reduce the number of string objects obtained from the text of Hamlet from 31,166 to 4,529…!
Sound good? If so, and you decide you need to intern your strings, you can do, provided you always instantiate them like this:
>>> import sys
>>> s = sys.intern('This is my new string!')
>>> t = sys.intern('This is my new string!')
>>> s is t
In general, though, you don’t need to deliberately intern your strings. Unless you’re aiming for a specific large-scale optimisation like in the examples above, this extra clutter in your code isn’t worth it. Keeping your code clean is so much more important!
This is the third of a series of articles I wrote on memory addresses in Python; here are the others:
III. Will Python intern my string? (this article)