Quantifying Simplified Chinese

What can we say mathematically about the difference between the simplified and traditional Chinese writing systems?

Imagine that tomorrow President Obama gets on the air and announces a plan to remove all spelling ambiguities in American English. Henceforth, trough will be spelled troff and to the horror of high school teachers everywhere, there, their, and they’re would all be spelled the same way. The rules of English phonology are notoriously hard to pin down and so, the thinking goes, removing this arbitrary barrier to literacy will let learners spend less time sounding out words and more time using them to actually convey and understand information. This proposal has been made many times, even once by President Roosevelt, but it is met time and time again with indignation and comparison to 1984's Newspeak; people tend to associate the idea of “dumbing down” a language with cultural decline (see: the battle over texting shorthand). For all its merits, a clean break from the linguistic sediment that’s been layered into English over the years would really be jarring, and is unlikely to ever gain widespread support.

That’s why when I started taking Mandarin last year, I was surprised to learn that the 1.3 billion people living in China are using a simplified version of their 3,000 year old script. Simplified Chinese characters, as opposed to what have become known as traditional characters, were put into use by the PRC during the 1950s and 60s, as both an attempt to increase literacy and do away with cultural baggage. In the Chinese-speaking world, the debate over simplified characters is a highly politicized one, as Taiwan, Macau, and Hong Kong never made the switch. People from those areas are derisive and understandably suspicious of communist China’s forcible overhaul of their shared script, sometimes preferring to refer to traditional script as 正体字 (correct characters) and simplified as 残体字 (incomplete characters). Although China’s literacy rate increased markedly in the second half of the 20th century, there are so many confounding variables that the debate about simplification’s effect on literacy is unlikely to ever be settled. The usual arguments for and against are summarized briefly here.

Proponents of simplification argue the following:

  • Simplified characters are easier to read and write because they contain fewer strokes. To be precise, a stroke is one of a set of eight defined pen motions one uses to write a character; the more strokes a character has, the more complex it is. Very common characters like 聽 (hear) and 爲 (for) are quite complex relative to the amount of information they convey and by simplifying them to 听 and 为 respectively, they become much easier to write and remember.
  • Simplified characters retain most of the consistency of traditional characters. We tend to think of Chinese characters as pictograms, but the vast majority are composed quite logically of other, simpler characters which are in turn made of radicals, the irreducible pictographic building blocks of the language. A great many characters comprise one component that provides meaning and another that provides a pronunciation clue. For instance, 請 (ask) is built from 言+青. The left half means language and the right half is pronounced similarly to 請. Therefore it can have its left-言 simplified to 讠(actually a centuries-old calligraphic shorthand) without any loss of semantic information, becoming 请. Repeat this for the many hundreds of characters with a left-言 and you’re starting to get somewhere. As a more drastic example, 讓 has been simplified to 让, but it still contains both a phonetic and a semantic component, as 襄 and 上 both provide about the same amount of phonetic hinting (that is, not much).
  • Simplified characters remove a lot of cruft from the Chinese language. Long ago, the character for cloud was 云. Then it was co-opted to mean a word with a similar pronunciation meaning “to say”, and in order to disambiguate the two, a rain radical (雨) was added to the character to make 雲. As 云 no longer has multiple meanings in modern Mandarin, simplification simply restored the character to its previous form.
  • Simplified characters remove unnecessary ornamentation. Characters like 錢 (money) contain reduplicated radicals that can be taken out to make 钱 without much loss. Similarly 龍 (dragon) becomes 龙 and 龜 (turtle, and recognizably a picture of one) becomes 龟.
  • Traditional characters are extremely hard to distinguish at small sizes. When you take 書 (book) and 畫 (draw) and shrink them onto a feature-phone screen, it can be a challenge to tell which is which. Simplified, they are 书 and 画, respectively.

Opponents of simplification argue the following:

  • Simplification is not always as consistent as it may seem and can remove valuable semantic information. The examples above, 书 and 画, also serve as arguments against simplification as they no longer contain the semantic component 聿 (writing brush) to suggest meaning. The traditional 聽 (hear) has an 耳 (ear) on the left that heavily suggests its meaning. The simplified version, 听, has a 口 (mouth) on the left and a 斤 (axe) on the right, both of which do precisely nothing to indicate what the character means. Opponents argue that this kind of simplification has the opposite of the intended effect and actually makes characters hard to remember, because their components lack a logical connection to the meaning or pronunciation of the character.
  • Simplification creates lots of characters that look too similar. 龟 may be easier to remember as a character for turtle but it also resembles 电 (electricity) for no reason except that it provides a convenient base. 無 (none) is simplified to 无, which looks a lot like 天 (sky), especially in handwriting. The left-hand 讠 radical looks suspiciously like the 氵(water) radical in handwriting.
  • Simplification actually resulted in the merging of quite a few characters, for example 后/後 (queen/after) → 后 and 發/髮 (emit/hair) → 发. The most notable example of this is the 乾/幹/干 → 干 merger. The first of those traditional characters means dry and the second character, among other things, means fuck — now that they’re the same the two translations are frequently confused with hilarious results. Less funny is the merging of surnames, which caused crises of identity like the one you would have if you suddenly found out that by order of the government, your last name suddenly no longer existed. In 1993 the PRC officially re-introduced the previously simplified character 鎔 simply because it was in the name of former Vice Premiere Zhu Rongji and he refused to stop writing it the old way.
  • One of the most common arguments against simplification is that it robs characters, for lack of a better word, of their character. Adherents to traditional point to how the character 愛 (love) lost its 心 (heart) in its simplification to 爱, and wonder if the same can’t be said of the language as a whole following simplification. This argument is softened somewhat when you consider how much written Chinese has evolved over the centuries, and that most users of simplified can read traditional characters anyway, but simplification still represents an abrupt cut in an otherwise unbroken three-millennium-long thread.
This poster asks “without a heart, where is the love?” (source)

From a learner’s perspective, I’ve always appreciated simplified characters. I think that the points I outlined above all have merit, but the one that holds the most weight for me is simply the fact that they’re easier to remember and write. Because I use them more often, the cognitive load for reading them is also lower (traditional users I know say the same about their system). And after all, I think that they’re more visually appealing — I find the ornamentation on characters like 髮 and 錢 distracting. But these are all subjective observations. Reaching for objectivity, the argument I often make is that simplified characters have a better information density in the sense that it takes fewer strokes to put down the same character. Recently, I found the tool to quantify exactly what I mean by that.

Some of the methods by which Chinese characters are simplified (source)

This semester at MIT I’m taking a class called 6.004: Computation Structures, which covers how digital systems (i.e. computers) are built from the ground up. In our first week, we covered the basics of encodings. An encoding is just a system for transmitting or storing information — this is a broad definition that suggests numerous familiar examples. Morse code encodes letters as dots and dashes. JPEG encodes photographs as numbers. The English language encodes ideas as words. Intuitively, it makes sense that a good encoding is one that makes it simpler to convey a common message than a rare one. This is why the word is has two letters and dephosphorylation has many, as well as why the letter e, the most common in many languages, is the simplest to tap out in Morse code. This intuition is mathematically cemented in the idea of a Huffman code, which is an ideal binary encoding for a set of symbols S and corresponding probabilities P(s). You can think of S as the set of letters in the alphabet and P(s) as the function that tells us how frequently, on a 0–1 scale, any given letter appears in text. From the probability P we can now introduce a formal definition of information:

The information carried by symbol s is defined as log2(1 / P(s))

Basically, the lower P(s) is, the more information we are given. As an example, consider a game of hangman. If we are told that the word we are guessing contains a Z, that narrows the list of possible words considerably more than if we were told it contains an E, because P(Z) is much lower than P(E) in the English language — there are fewer words that contain it.

Armed with this framework, let’s see how Huffman coding builds a binary tree out of a set of symbols. Here is the algorithm:

  • Determine a and b, the two elements in S with the lowest corresponding P(s); call these probabilities p and q. Remove these values from S.
  • Construct binary tree T with a dummy parent node and a and b as its children. Assign T a probability of p + q and add the tree to S; a and b now comprise a single element and there is one fewer item in S.
  • Repeat this process until there is one element in S: a binary tree containing all of your original symbols.
  • To extract the encoding for a symbol s, follow the tree from the root to s. If you go left, add a 0 to the encoding; otherwise, add a 1.

That was probably difficult to follow, but it suffices to see the result:

Huffman coding for the English language alphabet (source)

This is the Huffman coding the alphabet based on letter frequencies—this tree is actually used to compress data that is expected to be English text. The cool thing about this tree is that frequent, information-sparse symbols (e.g. E) are shallowly nested and have simple encodings (11), and infrequent, information-rich symbols (e.g. Z) are deeply nested, and have complex encodings (10000111). Actually, there are lots of cool things about this tree, but we are interested in these properties:

  • Huffman codings are optimal.
  • Huffman codings define a partial ordering of S, the set of symbols, in such a way that if two symbols a and b have P(a) P(b) then the length of b’s encoding is at least that of a’s encoding. In other words, if you write down all the symbols in S by increasing information on one line, and then write them down by increasing encoding complexity on the next line, then it’s always possible to make these lines identical.

What on earth does any of this have to do with Chinese? Well, we’ve just determined a useful property of an encoding: the order symbols appear by frequency should resemble the reverse of the order they appear by complexity of encoding. Let’s put aside the English alphabet and binary trees for now. Henceforth, our symbols are Chinese characters and our encoding mechanism is writing them down by hand. In the same way we want is to be short and dephosphorylation to be long, we also want common (information-sparse) characters to have few strokes and rare (information-rich) characters to have many. So my question is: is stroke count of Chinese characters a good predictor of information density?

My first impression is: not at all. There is a striking disparity between a list of characters sorted by stroke count vs one sorted by frequency. The simplest characters are littered with rarely used oddballs like 丐 (beggar), 孓 (mosquito larva), and 兀 (cutting off the feet), while the most common often look more like 還 (yet), 會 (will), and 就 (just). The first list betrays the original use of Chinese characters: one of the very simplest is 卜(divination), reflective of the script’s humble origin as a fortunetelling tool:

These [writings] were divinatory inscriptions on oracle bones, primarily ox scapulae and turtle shells. Characters were carved on the bones in order to frame a question; the bones were then heated over a fire and the resulting cracks were interpreted to determine the answer. — Wikipedia

As the language evolved, characters for more abstract notions were required, and these were usually built by banging two existing characters together or making a phonetic rebus, rather than thinking about how a very common word might warrant a very simple representation. Fast-forward three millennia and it takes seven strokes to write 我 (I/me) but only three to write 弋 (a kind of ancient retrievable arrow). This kind of disparity is not unique to Chinese (see tomorrow vs adz in English), nor is it really reflective of its incredible power as a communication tool, but it’s a quirk that has nonetheless always intrigued me, and I wanted to explore it some more.

An oracle bone fragment from the Shang dynasty The inscribed symbols are the direct precursors of today’s Chinese characters. (source)

To get started, I grabbed a couple of datasets and graphed character frequency vs. character complexity for the first 5,000 characters, the number an ordinary educated Chinese person would know. I obtained the probability of each character appearing (the most common 7 each make up about 1% of all Chinese text apiece, for instance) and used the logarithm formula from earlier to extract the information content of each one:

r-squared is 0.15 for simplified and 0.07 for traditional.

The correlation here is unconvincing, although there is an interesting lower bound that seems to appear for how much information a character might contain based on its stroke count. Next I compared character frequency index against stroke count. On all following graphs, a higher frequency index means the character is less common, so the most common character has frequency index 1. In this way, frequency index indicates information content of a character, because less frequent characters have lower probabilities of appearing.

r-squared is 0.14 for simplified and 0.07 for traditional.

The two methods tell us the same thing — there is, at best, a very weak correlation between how complex a Chinese character appears and how much information it carries. To compare a very appley apple to a very orangey orange, I also graphed English word length vs. frequency index and found an even more profound lack of meaningful trend.

r-squared is 0.05.

It’s clear that neither English or Chinese resembles an optimal encoding. This is inevitable because words or characters aren’t just random jumbles of strokes or letters but are constructed with intention, so that it’s clear that hydroplane has something to do with water, and that 談 has something to do with speech. There’s a balance to be struck here — add too much complexity to a word and it becomes redundant and ornamental, but remove too much, and your language is basically a Huffman coding; every word is the optimal length for its frequency, but contains no information about its own meaning, like a string of 1's and 0's. So while this is just one measuring stick we could choose to assess Chinese with, it does make for an interesting analysis of the two character sets. Simplified’s r-squared value is still twice traditional’s — what’s with that?

To better describe how the frequency list differs from the stroke-count list, I needed a new metric to describe the “disorder” of a list of numbers — how far the list is from its ordered state. For this I turned to insertion sort, which operates by swapping adjacent elements in a list until it is ordered. While considered inefficient, this algorithm has the useful property that it takes zero swaps to sort an already sorted list, and O(n²) swaps to sort a “maximally unsorted” list — i.e. one that is sorted in reverse order! So I wrote an insertion sort that counts and returns how many swaps it performs, then defined disorder for a list as follows:

disorder(L) = swaps to sort L / swaps to sort reversed(sorted(L))

The disorder for any list is a value between 0 and 1, where 0 means the list is already in order, and 1 means the list is in reverse order.

Insertion sort in action (source)

Now I could take a look at the disorder in the frequency-indexed list of stroke counts of a set of characters. By way of example, the ten most common Chinese characters are:

的, 一, 是, 不, 了, 在, 人, 有, 我, 他

The corresponding stroke count list is:

8, 1, 9, 4, 2, 6, 2, 6, 7, 5

It takes 21 swaps to order this list. The same list, sorted and reversed, is:

9, 8, 7, 6, 6, 5, 4, 2, 2, 1

It takes 43 swaps to put this list in order. Therefore the disorder of the list is 21 / 43 = 0.488. Not off to a great start! If disorder is low (< 0.5), that means that there is a positive correlation between complexity and information. If it is high (> 0.5) there is a negative correlation. For the first ten characters, there doesn’t appear to be much of a correlation at all! Undaunted, I applied this procedure to several sets of characters:

  • The first 5,000 characters by frequency (simplified)
  • The first 5,000 characters by frequency (traditional)
  • The 1,676 simplified characters among the first 5,000 that actually differ from their traditional counterpart, which I refer to as diff-simplified
  • The 1,676 traditional equivalents of those actually simplified characters, which I refer to as diff-traditional

Here is the result:

Ringing in at 0.3–0.4, these disorder levels indicate the same weakly positive correlation between complexity and information we’ve already seen. Honestly, I was a little disappointed to see how little a difference there was between the different character sets. Since the simplified sets display less disorder, they do in fact provide a better encoding in the information-theoretic sense — but not to the extent that I was expecting, or hoping. The script I wrote also provided some other statistics. Here are a few interesting bits that I decided didn’t merit a chart:

  • With 9,933 characters (all the ones in my frequency table), the difference between disorder in the two character sets is more pronounced, at 0.30 for simplified to 0.37 for traditional.
  • For the first 5,000 characters, the simplified set’s mean/median stroke count is 10.3/10 and the traditional set’s is 12.1/12.
  • The most complex character in my dataset is 鱺 (eel) with 30 strokes. In simplified it looks like 鲡.

Here are a few more charts I made while I was at it:

Using simplified tends to save about two strokes per character.

The stroke count of Chinese characters in this range appears to be distributed about normally…or is it binomially? Very qualitatively, this makes sense. The simplest characters only have a few strokes to combine in relatively few ways. There are many more very complex characters than very simple ones, but most of them fall right of the 5,000 mark. In the middle is a sweet spot where you have a lot of ways to combine a large pool of relatively simple characters, often just by merging two of them horizontally or vertically. As an aside, it’s hard to overstate how versatile this process is, and I recommend HanziCraft as a great way to visualize how characters are constructed. Anyway, as expected, this histogram shows simplification in action — the simplified distribution is tighter and shifted to the left.

This chart shows character frequency vs. number of strokes removed by simplification. It’s interesting because it reveals bands at 3 and 5 strokes saved containing tons of characters that were simplified by virtue of their radical. At 3 strokes saved, you have dozens of characters with a 糹(silk) on the side that is simplified to 纟, for instance 绿, 练, and 细. Similarly simplified are 金→钅(gold/metal), 貝→贝 (money/value), and 車→车 (car), all of which have a 3-stroke differential. At 5 strokes, you’ll likewise see many, many characters which contain a 言→讠 (speech), 食→饣 (food), or 門→门 (door) simplification. On the other hand, there are some characters visible below the zero line that actually gained strokes in simplification! The most common of these is 強→强 (strong), which I actually think gained some nice squareness at the cost of one stroke. Finally, with a whopping differential of 21 strokes, 廳→厅 (hall) is the most drastic change in the entire script. As 丁’s phonetic clue is almost as useful as 聽’s, I chalk this one up as a win, but I’ve seen several users of traditional react with horror to this change.

Like I said before, I love simplified characters. My view is usually that more simplification is better, because fewer strokes means less wrist pain for everyone. But I got a taste of what it must be like to have your entire writing system upended when I discovered the list of second-round simplifications that the Chinese government tried and failed to implement in the 1970s and 80s. Although these characters aren’t part of my native language or my heritage, I still recoiled instinctively when I saw these proposed changes:

Current version of the character on the left, proposed simplification on the right (source)

All of these simplifications are highly logical and rely mostly on homophonic substitution. But after just a few months of taking Chinese classes and getting familiar with 原, 菜, and 酒, learning about their etymology and construction, it seemed cruel to rip all that suddenly-meaningful ‘stuff’ out just to save a few strokes here and there. It’s an arbitrary line drawn in the sand — this is what Chinese characters looked like when I learned them, and it would make me sad, somehow, if they were simplified further. The rational part of me knows that if character simplification continues, it will probably be for the benefit of the billions of people who use them every day. But I’ll always have a place in my heart for 藏, 幕, 疑, and the thousands of other needlessly complicated characters whose complexity, after a fashion, invites investigation and untangling. And until Obama tells me otherwise, I’ll continue to keep my there’s, theirs, and they’res apart as well. ■

谢谢, 奥巴马 (source)

Data Sources

All of the code I used to generate this data is available on Github.