You probably have an idea of what the most popular passwords in English look like. There are lots of first names, cuss words, some ambiguous cases…
However, because much of the internet is built by and for the English-speaking world, we have a very limited understanding of the behaviors of non-Anglophone users. Since native English speakers only account for 5% of the world’s population, that’s a big chunk of the world whose user habits we’re missing. This is why I set out to find the password behaviors of non-English speakers — specifically, groups whose native language is not Latin-based.
I chose Mandarin Chinese. This was for three reasons: (1) I understand the language, (2) Chinese is an excellent example of a non-Latin writing system — heck, it’s not even alphabetic — and (3) with 1 billion Mandarin speakers out there, there would definitely be a large body of research to draw from.
A little background on Mandarin Chinese
There are three things you should know about Chinese languages before we begin. The first is that they are a family of related but distinct tongues. The second is that most of them share the Chinese writing system, which has 10,000+ characters, but of which only 3,000 are in common use. The third is that they are tonal.
A tonal language means that different inflections of a syllable will convey different meanings. For example, in Mandarin, 猪 (zhū), 竹 (zhú), 煮 (zhǔ), and 祝 (zhù) are pronounced just slightly differently from one another, but each coveys a different meaning. Moreover, two different characters can map to the same pronunciation. For example, 猪 = pig and 珠 = pearl are both pronounced “zhū” — in fact, there are 29 distinct characters that are all pronounced “zhū”.
This way of spelling Mandarin with Latin letters, by the way, is called the “pinyin system”. Under this system, “zhū” and “zhú” are considered distinct syllables. As you’ve probably noticed, many Chinese characters can map to the same pinyin, as in the case of the 29 zhū’s. If we further collapsed zhū, zhú, zhǔ, and zhù into one big, toneless group, there would be 117 zhu’s.
Typing in Chinese
If I, as a Mandarin speaker, wanted to text someone 你好 (nǐ hǎo), I’d start by typing “n-i” into a terminal, where it would give me a list of matching characters ranked by frequency — similar to autocomplete.
You’ll notice that I typed “n-i” instead of “n-ǐ”. This is mostly for simplicity, as special characters take longer to type. In the case of ǐ, though, it’s also a necessity; as far as I’m aware, it can’t be typed using most international keyboards. I had to find it in a Unicode table and copy-paste.
I bring this up because stripping the tones from the pinyin strips away their complexity, thereby reducing the range of unique syllables. In fact, we go from 10,000+ unique Chinese characters to only 410 unique (toneless) pinyin syllables. That’s a big change, and we’ll see how that comes into play for Chinese users’ password security.
So, what do Chinese passwords look like?
First, let’s take a quick look at the top 25 domestic Chinese passwords.
Just as for English-language passwords, there are a lot of numbers, abc’s, qwerty’s, and “password’s”. “Admin” is a default that definitely should’ve been changed… but what’s up with the “5201314”?
I then looked at a 2014 paper by Li Zhigong, Han Weili, and Xu Wenyuan that did an empirical analysis of leaked passwords from Chinese- vs. English-language websites, and here is what they found:
First, the authors looked at character distributions:
From a quick glance, we see that Chinese users are more likely to use numbers. We also see that certain characters — like q — are more popular with Chinese users, while other characters — like r and v — are more popular with English users. This is consistent with the character distribution of pinyin vs. English.
Li, Han, and Xu then looked at password composition — specifically, how many incorporated digits, letters, and/or symbols.
A huge chunk of Chinese users — 33%-65% — have digit-only passwords, which is dismally weak. A big chunk of English users — 35%-44% — have letter-only passwords, which is also dismal given that they’re often just dictionary words. A significant portion of both groups use letter+digit passwords, which could have decent security if implemented properly, but very few people include symbols in their passwords, so they’re still not as secure as they could be.
Digging deeper into password structure, Li, Han, and Xu then found that the most popular password structure in Chinese is 6 digits, while in English it’s 6 lowercase letters. They figured that, for the average Chinese user, numbers are easier to remember than Latin characters.
Next, they found that Chinese users are way more likely to use a keyboard pattern than English users, again, possibly because these are easier to remember.
Lastly, the researchers looked at sequential 8-digit-only and 6-digit-only groups within passwords and sought out whether these might be dates and, if so, which order of day, month, and year.
As it turns out, Chinese users are way more likely to go year, month, date, whereas English users are way more likely to place the year at the end. This matches the Chinese convention vs. the European & American conventions.
Personally, what I found most alarming was a discovery highlighted by researchers Wang Ding and Wang Ping in a 2015 paper. Using the same hacked datasets that Li, Han, and Xu used, they found that Chinese users were significantly more likely to include Chinese (pinyin) names in their passwords:
When everything is spelled in pinyin, distinguishing between first and last names (and nouns) can be tricky, but Wang and Wang estimated that up to 14% of users on Chinese websites use a pinyin name in their passwords, while only 5% of users on an English website do (possibly because these users are Mandarin speakers). While their paper didn’t make it clear whether it was those users’ own names or somebody else’s, I still found it concerning because (1) your name is the most public-facing information about you, and (2) on some platforms, your friend lists are public-facing too.
When we in America talk about password security, we typically assume that strings of letters that aren’t English words are reasonably “random” and therefore secure. As we’ve now seen, this is not the case — pinyin words may look like gibberish to an English-speaker, but they’re still quite easily cracked. In other words, when we incorporate cultural behavioral patterns, we notice some of the design flaws that occur when we force a group of users to adapt to a tool — in this case, a Latin-letter keyboard — that is way out of their comfort zones.
Why can’t we have Unicode characters (including Chinese characters) in passwords?
Good question. I want this too. Unfortunately, most systems I’ve encountered won’t allow it.
With my Android phone, you can long press a letter to enter an accented letter, and you can swipe left and right on the space bar to change between international keyboards. However, the minute I even try to do either of these things while on my lock screen, my phone goes to sleep, which I suspect was a conscious decision to brute force users into only using ASCII characters.
I think the original idea was to ensure consistency. ASCII characters — which include the English alphabet, numbers, and a few special symbols — appear on most (if not all) international keyboards, regardless of whether you’re using the QWERTY layout, the French AZERTY layout, or the Turkish F-keyboard layout. So the idea is that, by limiting passwords to ASCII chars, you could enter your password from any terminal. On the other hand, if you allowed for Chinese characters, the user would have to have pinyin software installed first — but then what if that gets deleted?
Another challenge is encoding. Other than Unicode, the two main encodings for Chinese are Guobiao and Big5. Try to copy-paste between GB, Big5, and Unicode, and you may get some weird gibberish.
Lastly, Unicode is ridiculous. Apparently, it’s not possible to say how many characters can be represented by Unicode, so I’m not sure what the security implications would be in a world of Unicode-enabled passwords.
Still, I think it’s worth exploring why even systems built for Chinese users don’t enable Chinese characters in passwords. As research has shown, the bulk of Chinese users are clearly unaccustomed to building strong passwords around ASCII characters, so maybe it’s time for a redesign.
From a computer security standpoint, think of it this way: We calculate entropy as H = L*log(N)/log(2). Think of what it might be like if N=3,000 or 1,0000+ instead of just the 95 printable ASCII characters. I’ll admit, implementing this in practice would mean tackling decades of tech debt on a global scale, so it isn’t exactly feasible, but wouldn’t it be nice?
Finally, don’t get me wrong — even with Chinese characters in passwords, I have no doubt that Chinese users could still fall prey to bad security habits. That’s just human nature. In this alternate universe, “我爱你” and “一二三四五” could very well become the next “ILoveYou” and “12345”.
I hope you liked this piece! Remember you can 👏🏼👏🏽👏🏾 up to 50 times!
PS: If you have any insight into the password habits of other non-Anglophone groups, please let me know in the comments!
PPS: I only summarized a handful of Li, Han, and Xu’s findings here, but their paper actually covers a lot more in detail. Same with Wang and Wang’s paper. I’m thinking of writing a Part 2 if there’s enough interest. Please 👏🏼👏🏽👏🏾 to let me know!