The ๐•†แ—ชโ’Ÿ๐™žศถั‡ of Unicode Homoglyphs

Andrew Mc
6 min readFeb 21, 2019

--

You may have heard of the concept of homoglyphs, or confusables as ีnicode calls them. To make it simplะต just because a letter on a cฮฟmputer looks like an X does not mean it is an ฮง. In fact a few of character๊ฎช in this last paragraph are not the standard Latin characters.

Above I avoided some more of the obโ…ดั–oโˆชs ones. What blends in varies highly based on the font you are using and the particular formatting around it. Overuse of them can cause the ransom note effect. The more devious, or the ones who have read about this before may have realized a problem with having identical characters. wรญkipedia.org is a great example. At first glance it may appear to be Wikipedia, but itโ€™s actually Desciclopรฉdia, the Portuguese version of Uncyclopedia.

Those who followed the link may have noticed they are protected from such an attack. In the navbar it changes to http://xn--wkipedia-c2a.org. Different programs handle this differently. For example Discord does the same thing, but Slack will happily show the Unicode version. The reason behind this is the DNS wants hostnames to be in ASCII, but still wants to be able to support non-latin characters. The solution to this is called Punycode.

Unicode is massive. As of this writing the latest version is 11.0, which contains 137,439 characters from 146 different scripts. These range from languages in active use to things like Inscriptional Parthian and Linear B. There is even some fun stuff in there like ๏ธ˜(U+FE18) whose name contains a misspelling of bracket as brakcet. There are 1,114,112 different code points in UTF-16, meaning that just over a tenth are used. Reducency is therefore not a problem and may go a bit to explaining why homoglyphs are allowed to exist.

If you want to play with what I am about to talk about yourself, head over to the website I made and give it a try: https://textconstructor.y42.xyz/

Cyrillic

By far one of the easiest foreign scripts to find equivalents to Latin characters in. It should be obvious why this is in Unicode being the language of many people.
ะฐ ั ะต ะพ ั€ ั… ัƒ โ€” Lowercase Russian Cyrillic
a c e o p x y โ€” Lowercase Latin
ะ ะ’ ะก ะ• ะ ะ† ะˆ ะš ะœ ะž ะ  ะ… ะข ะฅ โ€” Uppercase Russian Cyrillic
A B C E H I J K M O P S T X โ€” Uppercase Latin
ั– ั˜ ิ› ั• ิ าฎ า’ ิŒ โ€” Misc Cryllic
i j q s w Y F G โ€” Latin

Greek

Greek shares an overlap with Latin characters, mostly capitals. Interestingly it also has an overlap with Cyrillic which I will not get into here.
ฮ‘ ฮ’ ฮ• ฮ— ฮ™ ฮš ฮœ ฮ ฮŸ ฮก ฮค ฮง ฮฅ ฮ– ฮฟ ฮฝ โ€” Greek
A B E H I K M N O P T X Y Z o v โ€” Latin

A diagram for the overlap between Latin, Cyrillic, and Greek โ€” Credit Wikipedia

Armenian

There is not as much overlap here but there is still a few shared characters.
ิผ ี ึ… ีธ ีฝ โ€” Armenian
L S o n u โ€” Latin

Roman Numerals

A rather odd thing to include, given that they are just Latin letters. However they were added to Unicode as a way to make compatibility with older letter encoding systems easier.
โ…  โ…ค โ…ฉ โ…ฌ โ…ญ โ…ฎ โ…ฏ โ…ฐ โ…ด โ…น โ…ผ โ…ฝ โ…พ โ…ฟ โ€” Roman numerals
I V X L C D M i v x l c d m โ€” Latin

Bold/Italic/Sans-serif

These are all functionally the same as normal Latin characters but with some sort of formatting involved. Unicode says you should not use these for presentation markup, meaning you should not just substitute the bold set in to display a bolded Latin character. The sans-serif set will appear identical to the normal Latin set in a sans-serif font.
๐š๐›๐œ๐๐ž๐Ÿ๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐€๐๐‚๐ƒ๐„๐…๐†๐‡๐ˆ๐‰๐Š๐‹๐Œ๐๐Ž๐๐๐‘๐’๐“๐”๐•๐–๐—๐˜๐™ โ€” Bold
๐˜ข๐˜ฃ๐˜ค๐˜ฅ๐˜ฆ๐˜ง๐˜จ๐˜ฉ๐˜ช๐˜ซ๐˜ฌ๐˜ญ๐˜ฎ๐˜ฏ๐˜ฐ๐˜ฑ๐˜ฒ๐˜ณ๐˜ด๐˜ต๐˜ถ๐˜ท๐˜ธ๐˜น๐˜บ๐˜ป๐˜ˆ๐˜‰๐˜Š๐˜‹๐˜Œ๐˜๐˜Ž๐˜๐˜๐˜‘๐˜’๐˜“๐˜”๐˜•๐˜–๐˜—๐˜˜๐˜™๐˜š๐˜›๐˜œ๐˜๐˜ž๐˜Ÿ๐˜ ๐˜ก โ€” Italic
๐™–๐™—๐™˜๐™™๐™š๐™›๐™œ๐™๐™ž๐™Ÿ๐™ ๐™ก๐™ข๐™ฃ๐™ค๐™ฅ๐™ฆ๐™ง๐™จ๐™ฉ๐™ช๐™ซ๐™ฌ๐™ญ๐™ฎ๐™ฏ๐˜ผ๐˜ฝ๐˜พ๐˜ฟ๐™€๐™๐™‚๐™ƒ๐™„๐™…๐™†๐™‡๐™ˆ๐™‰๐™Š๐™‹๐™Œ๐™๐™Ž๐™๐™๐™‘๐™’๐™“๐™”๐™• โ€” Bold Italic
๐– ๐–ก๐–ข๐–ฃ๐–ค๐–ฅ๐–ฆ๐–ง๐–จ๐–ฉ๐–ช๐–ซ๐–ฌ๐–ญ๐–ฎ๐–ฏ๐–ฐ๐–ฑ๐–ฒ๐–ณ๐–ด๐–ต๐–ถ๐–ท๐–ธ๐–น๐–บ๐–ป๐–ผ๐–ฝ๐–พ๐–ฟ๐—€๐—๐—‚๐—ƒ๐—„๐—…๐—†๐—‡๐—ˆ๐—‰๐—Š๐—‹๐—Œ๐—๐—Ž๐—๐—๐—‘๐—’๐—“ โ€” Sans Serif (these also come in bold,italic, and bold/italic varients)

Bubble

This is a very old part of Unicode though it has been updated and expanded since it first appeared. It was used for making things like lists. ยฎ ยฉ โ„— are not actually part of this set considered instead to be unique symbols.
โ“โ“‘โ“’โ““โ“”โ“•โ“–โ“—โ“˜โ“™โ“šโ“›โ“œโ“โ“žโ“Ÿโ“ โ“กโ“ขโ“ฃโ“คโ“ฅโ“ฆโ“งโ“จโ“ฉโ’ถโ’ทโ’ธโ’นโ’บโ’ปโ’ผโ’ฝโ’พโ’ฟโ“€โ“โ“‚โ“ƒโ“„โ“…โ“†โ“‡โ“ˆโ“‰โ“Šโ“‹โ“Œโ“โ“Žโ“ โ€” Circled Latin
๐Ÿ…๐Ÿ…‘๐Ÿ…’๐Ÿ…“๐Ÿ…”๐Ÿ…•๐Ÿ…–๐Ÿ…—๐Ÿ…˜๐Ÿ…™๐Ÿ…š๐Ÿ…›๐Ÿ…œ๐Ÿ…๐Ÿ…ž๐Ÿ…Ÿ๐Ÿ… ๐Ÿ…ก๐Ÿ…ข๐Ÿ…ฃ๐Ÿ…ค๐Ÿ…ฅ๐Ÿ…ฆ๐Ÿ…ง๐Ÿ…จ๐Ÿ…ฉ๐Ÿ…๐Ÿ…‘๐Ÿ…’๐Ÿ…“๐Ÿ…”๐Ÿ…•๐Ÿ…–๐Ÿ…—๐Ÿ…˜๐Ÿ…™๐Ÿ…š๐Ÿ…›๐Ÿ…œ๐Ÿ…๐Ÿ…ž๐Ÿ…Ÿ๐Ÿ… ๐Ÿ…ก๐Ÿ…ข๐Ÿ…ฃ๐Ÿ…ค๐Ÿ…ฅ๐Ÿ…ฆ๐Ÿ…ง๐Ÿ…จ๐Ÿ…ฉ โ€” Negative Circled Latin

Small Capitals

Like normal capital letters but smaller. These are meant for IPA representations of spellings. This set is missing the x, and the f, q, and s are not available on some systems.
แด€ส™แด„แด…แด‡๊œฐษขสœษชแดŠแด‹สŸแดษดแดแด˜๊žฏส€๊œฑแด›แดœแด แดกสแดข โ€” Small caps

Superscript / Subscript

The most interesting thing here is the lack of the letter q in superscript lowercase. If this post does one thing, I hope it starts momentum to fix this oversight. Both subscript and superscript capital are missing a number of characters.
แดฌแดฎแดฐแดฑแดณแดดแดตแดถแดทแดธแดนแดบแดผแดพแดฟแต€แตแต‚ โ€” Superscript capital
แตƒแต‡แถœแตˆแต‰แถ แตสฐโฑสฒแตหกแตโฟแต’แต–สณหขแต—แต˜แต›สทหฃสธแถป โ€” Superscript lowercase
โ‚โ‚‘โ‚•แตขโฑผโ‚–โ‚—โ‚˜โ‚™โ‚’โ‚šแตฃโ‚›โ‚œแตคแตฅโ‚“ โ€” Subscript

Upside Down Text

This one is a hodgepodge of different characters. Different people implement the idea differently. Wikipedia actually has a chart comparing sites (the site I made uses the list on that page) that have done which may be close to the most ridiculous feature comparison on Wikipedia. Most of the truly flipped characters are used in IPA.
Zโ…„XMฮ›โˆฉโŠฅSแดšแƒขิ€ONWหฅ๊“˜ลฟIHโ…โ„ฒฦŽแ—กฦ†แ—บโˆ€zสŽxสสŒnส‡sษนbdouษฏืŸสžษพฤฑษฅฦƒษŸวpษ”qษ โ€” Upside Down Latin

Paratherized

This is from the same block as the bubble letters above and are used for a similar purpose.
โ’œโ’โ’žโ’Ÿโ’ โ’กโ’ขโ’ฃโ’คโ’ฅโ’ฆโ’งโ’จโ’ฉโ’ชโ’ซโ’ฌโ’ญโ’ฎโ’ฏโ’ฐโ’ฑโ’ฒโ’ณโ’ดโ’ต โ€” Paratherized Latin

Squared

From the same block as Paratherized and Bubble letters. This block also contains the regional symbols (๐Ÿ‡บ ๐Ÿ‡ธ) which are used to make the country flag emojis. The negative version of squared does not render the same for every character because of both blood types (A,B,O, and AB) and parking (the P). There exists a few combination squares as well . ๐Ÿ†‘๐Ÿ†’๐Ÿ†“๐Ÿ†”๐Ÿ†•๐Ÿ†–๐Ÿ†—๐Ÿ†˜๐Ÿ†™๐Ÿ†š ๐Ÿ†Š ๐Ÿ†‹ ๐Ÿ†Œ ๐Ÿ†๐Ÿ†Ž๐Ÿ†
๐Ÿ„ฐ๐Ÿ„ฑ๐Ÿ„ฒ๐Ÿ„ณ๐Ÿ„ด๐Ÿ„ต๐Ÿ„ถ๐Ÿ„ท๐Ÿ„ธ๐Ÿ„น๐Ÿ„บ๐Ÿ„ป๐Ÿ„ผ๐Ÿ„ฝ๐Ÿ„พ๐Ÿ„ฟ๐Ÿ…€๐Ÿ…๐Ÿ…‚๐Ÿ…ƒ๐Ÿ…„๐Ÿ……๐Ÿ…†๐Ÿ…‡๐Ÿ…ˆ๐Ÿ…‰ โ€” Squared
๐Ÿ…ฐ๐Ÿ…ฑ๐Ÿ…ฒ๐Ÿ…ณ๐Ÿ…ด๐Ÿ…ต๐Ÿ…ถ๐Ÿ…ท๐Ÿ…ธ๐Ÿ…น๐Ÿ…บ๐Ÿ…ป๐Ÿ…ผ๐Ÿ…ฝ๐Ÿ…พ๐Ÿ…ฟ๐Ÿ†€๐Ÿ†๐Ÿ†‚๐Ÿ†ƒ๐Ÿ†„๐Ÿ†…๐Ÿ††๐Ÿ†‡๐Ÿ†ˆ๐Ÿ†‰ โ€” Negative Squared

Blackboard Bold

This group is used for math, specifically number sets. It is not new, and is thought to have come from a 1965 textbook on complex analysis.
๐•’๐•“๐•”๐••๐•–๐•—๐•˜๐•™๐•š๐•›๐•œ๐•๐•ž๐•Ÿ๐• ๐•ก๐•ข๐•ฃ๐•ค๐•ฅ๐•ฆ๐•ง๐•จ๐•ฉ๐•ช๐•ซ๐”ธ๐”นโ„‚๐”ป๐”ผ๐”ฝ๐”พโ„๐•€๐•๐•‚๐•ƒ๐•„โ„•๐•†โ„™โ„šโ„๐•Š๐•‹๐•Œ๐•๐•Ž๐•๐•โ„ค โ€” Blackboard Bold

Full Width

The official text of ๏ฝ๏ฝ…๏ฝ“๏ฝ”๏ฝˆ๏ฝ…๏ฝ”๏ฝ‰๏ฝƒ. This is a holdover from early usage of Chinese on the computer. A Chinese character is closer to a square so it would take up two character slots on a terminal. To keep formatting ASCII text consistent they added full width characters which take up two normal ASCII character slots.
๏ฝ๏ฝ‚๏ฝƒ๏ฝ„๏ฝ…๏ฝ†๏ฝ‡๏ฝˆ๏ฝ‰๏ฝŠ๏ฝ‹๏ฝŒ๏ฝ๏ฝŽ๏ฝ๏ฝ๏ฝ‘๏ฝ’๏ฝ“๏ฝ”๏ฝ•๏ฝ–๏ฝ—๏ฝ˜๏ฝ™๏ฝš๏ผก๏ผข๏ผฃ๏ผค๏ผฅ๏ผฆ๏ผง๏ผจ๏ผฉ๏ผช๏ผซ๏ผฌ๏ผญ๏ผฎ๏ผฏ๏ผฐ๏ผฑ๏ผฒ๏ผณ๏ผด๏ผต๏ผถ๏ผท๏ผธ๏ผน๏ผบ โ€” Full Width Latin

Script

Unicode has included another set of letters for the purpose of mathematics. This is quite fancy looking and comes in normal and bold variants. This has less support than many other items on this list, notably ChromeOS can not display it.
๐’ถ๐’ท๐’ธ๐’น๐‘’๐’ป๐‘”๐’ฝ๐’พ๐’ฟ๐“€๐“๐“‚๐“ƒ๐‘œ๐“…๐“†๐“‡๐“ˆ๐“‰๐“Š๐“‹๐“Œ๐“๐“Ž๐“๐’œ๐ต๐’ž๐’Ÿ๐ธ๐น๐’ข๐ป๐ผ๐’ฅ๐’ฆ๐ฟ๐‘€๐’ฉ๐’ช๐’ซ๐’ฌ๐‘…๐’ฎ๐’ฏ๐’ฐ๐’ฑ๐’ฒ๐’ณ๐’ด๐’ต โ€” Script Latin
๐“ช๐“ซ๐“ฌ๐“ญ๐“ฎ๐“ฏ๐“ฐ๐“ฑ๐“ฒ๐“ณ๐“ด๐“ต๐“ถ๐“ท๐“ธ๐“น๐“บ๐“ป๐“ผ๐“ฝ๐“พ๐“ฟ๐”€๐”๐”‚๐”ƒ๐“๐“‘๐“’๐““๐“”๐“•๐“–๐“—๐“˜๐“™๐“š๐“›๐“œ๐“๐“ž๐“Ÿ๐“ ๐“ก๐“ข๐“ฃ๐“ค๐“ฅ๐“ฆ๐“ง๐“จ๐“ฉ โ€” Bold Script Latin

Fraktur

This might be one of the weirder alphabets on this list. There are 400 years of history behind the usage of it. Up until the early 20th century it was the German lettering of choice. The move away from it was a dispute worthy of a Wikipedia article. It does contain the entire Latin alphabet so it is included here.
๐”ž๐”Ÿ๐” ๐”ก๐”ข๐”ฃ๐”ค๐”ฅ๐”ฆ๐”ง๐”จ๐”ฉ๐”ช๐”ซ๐”ฌ๐”ญ๐”ฎ๐”ฏ๐”ฐ๐”ฑ๐”ฒ๐”ณ๐”ด๐”ต๐”ถ๐”ท๐”„๐”…โ„ญ๐”‡๐”ˆ๐”‰๐”Šโ„Œโ„‘๐”๐”Ž๐”๐”๐”‘๐”’๐”“๐””โ„œ๐”–๐”—๐”˜๐”™๐”š๐”›๐”œโ„จ โ€” Fraktur
๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–… โ€” Bold Fraktur

This is not a comprehensive list. Unicode is massive and they continue to add to it every year. In the future more homoglyphs may appear. I welcome any feedback, interesting tidbits, and corrections. There will be an attempt to keep this article up to date but no guarantee. I hope this inspires you to look at the text of Unicode in a different way and make your own discoveries about the weirdness it contains.

--

--

Andrew Mc

Life and death and love and birth, and peace and war on the planet Earth. https://github.com/LaikaFusion