The Chinese Language as a Weapon: How China’s Netizens Fight Censorship

by Jeanette Si

Published in

Berkman Klein Center Collection

10 min readJul 21, 2017

After Liu Xiaobo’s death on July 13, Chinese censors knew they had to work quickly. After all, Liu had been a prominent activist for democracy while alive, an integral figure in the Tiananmen Square protests — who just so happened to pass away while serving his sentence for dissenting against the Chinese Communist Party.

What’s more, it was implied that the Nobel laureate was denied treatment for his liver cancer while in custody. His wife was caught on video telling a friend the authorities “[c]an’t operate, can’t do radiotherapy, can’t do chemotherapy” on her husband, even though he was in critical condition.

His death makes Liu Xiaobo the second Nobel Peace Prize recipient to die while incarcerated, and the latest in a long line of martyrs who have perished for the cause of Chinese freedom.

In other words, it was only natural that Liu Xiaobo would become a targeted keyword.

Since July 13, Weibo has blocked all mentions of the name “Liu Xiaobo,” the phrase “R.I.P.,” and even the candle emoticon. Searches for his name on many other sites would turn up empty. But at the same time, seemingly at random, came an influx of posts mentioning a certain “Wang Xiaobo” and another “Teacher Liu” — which were both, of course, cryptic nicknames for the late Nobel laureate.

“Wang Xiaobo” simply switches out Liu’s last name. But “Teacher Liu” has a more ingenious derivation. Since Liu is the fourth most common Chinese last name and Chinese students address their teachers with the prefix “Teacher,” it’s the rough equivalent of giving someone the English pseudonym “Mr. Smith.” This poses a problem for censors, as the epithet makes it quite difficult to algorithmically discern whether or not a user was referring to their grade-school math teacher or the controversial Teacher Liu.

It may seem like a convoluted system of doublespeak to some, but for Chinese netizens, this is the norm — and always has been. Much of Chinese Internet lingo involves codewords, and the corpus of codewords is constantly changing to accommodate new topics and avoid smarter, stricter censors. It has reached the point where a simple understanding of Chinese vocabulary, syntax, and grammar is no longer enough to fully understand Chinese Internet discourse. On today’s Chinese Internet, fully comprehending the language requires a thorough knowledge of current events, a deep respect for historical implications, an agile mastery of cultural conventions, and more often than not, a healthy appreciation of topical humor.

Homophonic Codewords

For Chinese learners, “tone” is often one of the most confusing parts of the spoken language. In Mandarin, there are four different tones — or inflections — for every sound, and each sound can convey one or more different characters. In fact, 80% of all monosyllabic sounds in Chinese can be matched to multiple meanings.

It’s this system of overlapping sounds that makes Chinese a highly contextual language, where the correct meaning of any syllable must be discerned with respect to the syllables surrounding it. New learners may take a while to pick out the correct character meant by any spoken syllable, but it’s this same ambiguity in pronunciation that allows Chinese netizens impressive latitude in constructing sound-alike codewords for sensitive topics.

One of the most infamous homophonic codewords on the Chinese Internet is “river crab,” or “héxiè” (河蟹). Created as a mockery of former president Hu Jintao’s “harmonious society” initiative which sought to silence dissent, river crab is a near-homophone for harmony, or “héxié” (和諧). At first, netizens made fun of how the government censors would now “harmonize” dissidents on the Internet by taking down their content. Eventually, being “harmonized” evolved into being “river-crabbed” as a precaution against the government censors, and was quickly adopted by many Chinese online communities.

The humble mascot of the Chinese Internet resistance. (from Wikimedia Commons)

However, not all codewords of this nature are political — many of them are unrelated to the government and target the other content areas that Chinese censors block. When Chinese censors began to filter out posts for profanity, a rather strange term began floating around the Chinese Internet: “grass mud horse,” or “cǎonímǎ” (草泥馬). There is no real animal that is called a grass mud horse in Chinese, but these three characters were chosen for their sound, not their meaning. Taken together, they are a close homophone for an extremely obscene Chinese insult newly targeted by profanity censors.

Homophones have now become a weapon of the resistance, a way for context-sensitive Chinese netizens to speak about taboo content. It remains one of the most popular methods for creating codewords, as almost any netizen with an ear for recent events can sound out the words and match them to a blocked keyword.

Logographic Codewords

Chinese is one of the few remaining logographic languages still in wide use today, and the visual nature of the written characters presents even more possibilities for Chinese netizens seeking to evade censors. Many characters appear very similar to other characters, to the point where even native speakers often mix up characters in writing.

Many codewords take advantage of these similarities and convey banned concepts with lookalike characters. This has the added benefit of not triggering homophone censors, as many times, these lookalike characters do not sound alike at all. In fact, it may be especially difficult for those out of the loop to pick up on the existence of these codewords without some contextual hints.

For instance, “eye-field” is a codeword used in many circles in lieu of the word “freedom.” The connection between the two phrases may not be immediately apparent, as in Chinese, they don’t sound the same at all — “eye-field” is “mùtián,” while freedom is “zìyóu.” But the connection becomes more salient when we examine what these respective words look like: the characters for “eye-field” are 目田, while the characters for freedom are 自由. The two sets of characters look remarkably similar and are only differentiated by one stroke each.

This codeword was created by Chinese World of Warcraft players after they realized that many words had been blocked from the in-game chat, even potentially innocent ones like “freedom.” It’s also rather fitting, as many netizens say that the characters for “eye-field” look like “beheaded” versions of the characters for “freedom,” symbolizing how freedom of expression in China is still largely limited.

Though logographic codewords are a possibility, they aren’t as prevalent as other types of codewords since they require much more context to pick up on. To most people, “eye-field” would seem like gibberish if nobody explained the context behind it — “eye-field” itself isn’t related to any existing Chinese words in meaning or sound. But because of their relative obscurity, they may last longer than other codewords in terms of being detected by censors.

Allusory Codewords

Chinese is a language rich with historical context — many of Chinese’s four-character idioms, for instance, are based upon the (sometimes apocryphal) doings of the ancients. It would make sense, then, that modern Chinese net lingo often incorporates recent events, cultural allusions, and historical references to disguise sensitive terms from Internet filters.

Recent event codewords are more often than not bastardized versions of topical words and phrases. For instance, the phrase “hide-and-seek” (“duǒmāomāo,” 躲貓貓) is a euphemism for Chinese police brutality in prisons. This phrase was lifted from a 2009 police report which said a farmer jailed for unpermitted logging died in prison due to an injury sustained while “playing hide-and-seek” with the other inmates.

Cultural allusions often have to do with certain trends that occur in modern Chinese society, and codewords of this family usually take the form of ordinary Chinese phrases inserted into unconventional contexts. For example, to “check a water meter”(“chāoshuǐbiǎo,” 抄水表) can also mean for the police to pay someone a home visit. Since suspicious Chinese citizens are often reluctant to open their doors for the police, policemen often pose as water meter readers or mailmen in order to coax residents into letting them in.

Historical codewords borrow from most Chinese people’s collective historical literacy for their contrived meaning. Usually, these codewords use a historical phrase to refer to a contemporary functional equivalent, and typically draw pretty obvious parallels between the two. For example, the “imperial capital” (“dìdū,” 帝都), an antiquated term for the city the emperor is headquartered in. In an Internet context, it is used by netizens to criticize the metonymical “Beijing” and conjures images of the head-of-state as an undemocratically appointed ruler with near-absolute power. The term’s widespread usage eventually caused Weibo to add it to its list of blockwords in 2015.

It may seem odd that understanding the Chinese of the Internet also involves understanding the Chinese of the past, but like many other languages, Chinese is highly contextual — and this context could be nearly anything, from the latest news headline to a reference to an ancient hero. And with the highly volatile atmosphere in China today, there may be many more allusion-worthy events to come.

Crippling Internet Censors: A Logical Extreme

The above taxonomy of codewords is not meant to be a definitive reference, nor are the categories mutually exclusive — one codeword may have traits from multiple categories. It’s meant simply as a way of conceptualizing how some codewords have come into being on the Chinese Internet, and to give a rough outline of the variety of different methods that Chinese netizens may use to derive their codewords. However, the efficacy of these methods, if taken to extremes, can be potentially devastating to Internet censors.

A 2015 Georgia Tech study, “Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions,” plays with the concept of homophonic codewords by creating an algorithm that generated homophones for blocked terms by choosing random characters with similar sounds. Though the authors did not use codewords pre-established by the Internet community, they wanted to see if A) these randomly generated substitute words could bypass censors and B) whether native Chinese speakers would be able to pick up on the intended meaning of these substitutes anyway.

Their results showed that compared to a control group of Weibo posts that contain unaltered blockwords, their posts with homophonic substitutions were more likely to be published and stayed posted longer before being taken down. They also found that native Chinese speakers were able to discern the correct meanings of their homophonic substitutes 99.51% of the time, even though these substitutes were randomly generated by a phonetic algorithm and not part of any pre-existing lexicon.

From the ratios they observed, the researchers predicted that if Weibo were to add all possible homophones for sensitive topics to their block list, 20% of all daily posts on Weibo would be flagged as false positives. If one in every five Weibo posts are falsely flagged, it would severely cripple innocuous conversations and may discourage further use of the platform.

They also posited that for Weibo to employ human readers to separate out each post for potential homophonic substitutions, they would have to add 15 hours of human labor per day per homophone transformation of a keyword — a substantial increase, considering the sheer number of possible homophones for each keyword.

The researchers of this study concluded that a random homophone generator may be an effective tool for bypassing China’s strict content censors, and that the technique used in their study “can likely be assembled with other tools to build a real-time system for Chinese social media users to circumvent censorship.” Though the study did not utilize a codeword lexicon, it does go to show that some of the inherent qualities of the Chinese language make it an especially difficult language to filter to any degree of accuracy.

Conclusion

China has long been known for possessing one of the world’s most advanced Internet filtering systems, but from the idiosyncrasies of their primary language, it isn’t hard to see why they’d need one. If the filters were not actively looking for sensitive content at each step of the content publishing process, they may overlook some controversial content. And from the looks of it, they already do.

Part of the reason for this oversight is the sheer number of codewords and euphemisms that exist for sensitive content, but another part is that the censors are starting to realize, to some extent, that full censorship is not possible (nor particularly desirable).

In an article from the IHT Magazine quoted by the New York Times, Chinese writer Yu Hua says that, with respect to the river crab phenomenon, “Officials are aware, of course, of [river crab’s] barbed meaning on the Internet, but they can hardly ban it, because to do so would outlaw the ‘harmonious society’ they are plugging. Harmony has been hijacked by the public.”

Whether or not this means China’s Internet discourse will become freer in the future, considering the current state of affairs, is hard to say. But with a language so colorfully versatile, steadily increasing Internet access, and only the world’s biggest population, the underground lexicon of China’s Internet still has much more room to grow and adapt to whatever situation the filters may throw at it next.

Jeanette Si is a summer intern with the Internet Monitor team at the Berkman Klein Center, supporting research efforts regarding freedom of expression on the Internet.