Python Identifiers: What’s in a Name?

That which we type by any other language would still be Unicode!

Jon Wong
The Startup

--

Image courtesy of https://mamasweetie.com/baby-name-search/

Last updated: 09 Feb 2021

Python welcomes everyone!

Whatever natural language you use (English? Chinese? Japanese? Arabic? Hebrew? Others?), Python welcomes you warmly. Python 3 lets you use Unicode to specify names (identifiers), not just to type strings.

In your Python interpreter, do the following to witness Unicode in Python names:

1: >>> file = "MyFile.txt"
2: >>> folder = "MyFolder"
3: >>> 文件 = file
4: >>> 文档 = folder
5: >>> print(文件)
MyFile.txt
6: >>> print(文档)
MyFolder
7: >>> # Unicode in strings has been possible for a long time
8: >>> file = "我的文件.txt"
9: >>> print(file)
我的文件.txt

That means you can collaborate with your Chinese friends even though most of your software is written in English!

This wasn’t possible in Python 2:

1: >>> 文件 = "MyFile.txt"
File "<stdin>", line 1
文件 = "MyFile.txt"
^
SyntaxError: invalid syntax

Now that we’ve started having fun, let’s get down to understanding how to create valid identifiers/names in Python 3!

The official word on names

According to Python docs (3.9.1), paraphrased into natural language:

A Python name consists of a starting character and zero or more continuing characters.

We then see in those docs that the starting character and continuing character placeholders can only accept Unicode characters that are of certain “Unicode category”: Lu (uppercased letters), Ll (lowercased letters), Nd (decimal numbers), and so on.

Whether a Unicode character is allowed in a Python name depends on the character’s “category”. Is the character a letter (uppercased, lowercased)? Is it a connector punctuation (eg underscore)? Is it a number?

What category is a character in?

To check the Unicode character’s category:

1: >>> import unicodedata as ucd
2: >>> ucd.category('文')
'Lo'
3: >>> ucd.category('A')
'Lu'
4: >>> ucd.category('a')
'Ll'
5: >>> ucd.category('_')
'Pc'
6: >>> ucd.category('3')
'Nd'

With that equipped, let’s explore the type of Unicode characters that can go into constructing a Python name.

Python’s “unicodedata” package is useful for checking a Unicode character’s category.

To start a name

The starting character can be a letter or an underscore.

Letters

The starting character can be a letter of these types (in decreasing order of familiarity and common use, and you’ll probably want to bother only with the first two):

  • Uppercased, lowercased (eg A and a). Category: Lu and Ll
  • Uncased. The majority of letters in Unicode have no case! Chinese characters are one of many examples, such as in the code example above. Category: Lo
  • Title cased. Rare, such as the Dutch single-character Lj (‘Lj’), whose uppercase is LJ (‘LJ’) and lowercase is lj (‘lj’), as seen in such as “LJsselmeer”. Category: Lt
  • Non-language letters that are modifiers typically used in phonetic transcriptions (eg ‘ː’ in ‘uː’ in the IPA for English). Category: Lm

A Python name can be started with a letter, uppercased or lowercased (or title cased in rare scenarios). In the majority of human languages (yes, English is but one!), letters are uncased; these letters can also start a Python name.

Underscore

Python names can also start with an underscore (_), Unicode category Pc. Names that start with underscores hold special meaning for Python, so use them with care.

Here’s one example of a name started with _ having “special meaning” to Python. Note how Python modifies (officially termed “mangling”) the name!

1: >>> class Foo:
2: ... __my_var = 5
3: ...
4: >>> dir(Foo)
['_Foo__my_var']

A Python name can be started with an underscore, but use with care! Names started with underscore hold special meaning for Python.

To continue/complete a name

Everything that can be used to start a name can also be used to continue/complete a name.

A Python name can be continued/completed with letters or underscores. That is, everything that can start a name can continue/complete a name.

In addition, the following categories of characters can also be used to continue/complete a name.

More letters

Just as a Python name can be started by a letter, it can be continued/completed with letter-related characters. The following categories of non-letter characters, though only used in conjunction with letter characters, do count as being related to letters.

  • Non-spacing marks. A non-spacing mark is placed within the horizontal confines of the previous character, either above or below in relative positioning. Eg diacritics such as in á (acute), ô (circumflex), and ç (cedilla) and even ơ (where the above-right “horn” is still within the right edge of o). Category: Mn
  • Spacing marks. A spacing mark is placed a certain horizontal advance (distance) from the previous character, but not as much advance as for an entirely separate character. Usually vowels to be attached to consonants in the Brahmic scripts. Eg the Devanagari vowel (a) that marks the consonant (t) to form ता in ममता (“mamta”, “motherly love”). Category: Mc

A Python name can be continued/completed with characters related to letters, such as diacritics that usually mark vowels, and also vowels that mark consonants (in Brahmic scripts).

Numbers

A Python name can also be continued/completed with numbers.

  • Decimal numbers. Such as 1, 4, 7. Category: Nd
  • Letter numbers. These are usually numeral characters that aren’t Arabic numerals (decimal numbers), usually from numeral systems that are no longer in use: Roman numerals, Suzhou numerals. Category: Nl

Possibly interesting trivia: Surprisingly, everyday Chinese characters that represent numbers (一二三八九零) are categorized as Uncased letters (category Lo) instead of Decimal numbers, possibly because China has long been using Arabic numerals (decimal numbers). This is evident in Chinese addresses, such as those of famous restaurants: 成华区玉双路2号, 青羊区清江东路198号, etc.

Contrast this with Bengali numerals that are still in use today. Bengali characters that represent numbers are categorized as Decimal numbers (category Nd).

A Python name can also be continued/completed with numbers.

What is NFKC?

The Python docs on idenfiers/names indicates that something called “NFKC” is involved. What is it?

NFKC is a Unicode normalization process that deals with many-to-one Unicode character mappings. Many Unicode characters have multiple alternative representations that map to the same visually recognizable glyph!

If the above summary didn’t satisfy you, we can dive deeper.

The Backus Naur statement in the docs says that “a Python identifier starts with xid_start”.

Diving deeper into that statement shows that xid_start is the subset of characters in id_start whose NFKC form is in set id_start xid_continue*; apparently, some decomposition occurs, splitting a character into two or more characters. (Technically, if you know Regular Expression, it’s “one or more characters”, to be mathematically precise.)

And xid_continue is likewise the subset of characters in id_continue whose NFKC form is in id_continue. Ignore the possibly convoluted rabbit hole in that recursive statement. We will arrive at the key takeaway soon.

The key takeaway is that Python seems to apply some sort of “NFKC transformation” to every string we submit to Python for creating a name.

The topic of Unicode normalization is discussed in this article, if you’re up for further reading. As you go through that article, you’ll see that Python indeed decomposes some characters into multiple (smaller) characters.

The Python language is designed for an international audience. Python welcomes you warmly!

UPDATE 09 Feb 2021:

I had not noticed that the software community has taken note of an urgent need to advocate for equality among all peoples on Earth. Back in mid 2020, the Git community wanted to change a default label from “master” to “main”; GitHub already has. (I was about to embark on writing about Git.)

I salute, in equal measure, the people who are oppressed and the people brought up to oppress. While the oppressed diligently learned extra behavioral protocols that alleviated the oppressors’ already high tensions and fears with regards unaccustomed circumstances, the oppressors risk lethal admonishment from their own communities as they sought to gradually normalize human interaction with the oppressed.

I dedicate this article, the first topic about Python as well as the feasible ideal of inclusivity, to everyone who works hard to build a world of love and understanding.

--

--

Jon Wong
The Startup

Jon writes technology tutorials, fantasy (a dream), linguistics (phonology, etymologies, Chinese), gaming (in-depth playthrough-based game reviews).