Confusable Character Detection in Erlang
Grindr is the world-leading social networking app that connects gay, bi, trans, queer, and other people in almost every country in the world. With millions of users opening the app every day, Grindr’s popularity makes it an attractive platform for spammers.
Grindr’s use throughout the world necessitates support for transmitting text in any language, so we support sending characters defined by the Unicode standard, using the UTF-8 encoding. However, the Unicode standard specifies over 140,000 characters, and there are groups of characters that resemble one another visually. This fact is well-known by spammers and other bad actors, who are known to craft messages and names by substituting letters in words with characters that are visually confusable with the letter being substituted. This is a common way of stepping around simple spam filters, as well as uniqueness checks for names.
Grindr is committed to protecting its users from spam, and we employ several tactics to try and prevent spam from being transmitted on our platform. Grindr’s chat system is built using Erlang. We’re excited to announce that we have open-sourced erl-tr39, an Erlang library for determining whether pieces of text are visually confusable.
What does it mean for text to be visually confusable?
People can usually tell that “𐌁𝕠𝖇” spells “Bob”. However, computers generally cannot see the resemblance, because computers usually compare strings of text by their representation as bytes, rather than by their visual presentation. Consider the codepoints that comprise “Bob” and “𐌁𝕠𝖇”:
What’s important to note here is that “𐌁𝕠𝖇” is not the result of any trick using fancy fonts; it is comprised of Unicode characters that are distinct from the characters that make up “Bob”. As a result, “𐌁𝕠𝖇” has a very different byte representation than “Bob”, and any procedure for testing string equality using byte representations will consider “𐌁𝕠𝖇” and “Bob” to be different strings.
Spammers can and do leverage visually confusable characters in order to bypass word filters and name checks. In one case, 37,000 people downloaded a fake version of the popular Chrome extension AdBlock Plus, called “АdВlосk Рluѕ”, where several of the Latin letters have been replaced with identical Cyrillic letters. When encoded with any of the valid Unicode encodings, these two strings do not compare equal!
Thankfully, the Unicode Consortium has been openly working on Technical Standard #39, which describes different ways of detecting and avoiding security issues related to the use of Unicode. One of the contributions of this report is a method for detecting whether two strings of Unicode characters are visually confusable. One of the steps is to map visually confusable characters to their prototypes. This mapping is specified in a text file that exists alongside the report. Here’s a line from confusables.txt:
1D6D9 ; 03C8 ; MA # ( 𝛙 → ψ ) MATHEMATICAL BOLD SMALL PSI → GREEK SMALL LETTER PSI #
This line states that the prototype or ‘exemplar’ character for character 0x1D6D9, or MATHEMATICAL BOLD SMALL PSI, is character 0x03C8, or GREEK SMALL LETTER PSI.
TR39 specifies a function skeleton(X), which is comprised of the following 3 steps:
- Convert the input text to Normalization Form D (NFD). This basically means to convert characters with accents, such as 0x212B (Å), into a sequence of unaccented characters and combining accents, such as 0x41 (A) and 0x30A ( ̊).
- Concatenate the prototype character of each character from step 1
- Convert to NFD once more
Two strings X and Y are confusable if skeleton(X) == skeleton(Y).
Introducing erl-tr39
erl-tr39 is the result of a short escript parsing confusables.txt and producing an Erlang module that, among other things, exports a skeleton/1 function. Here is erl-tr39 in action:
Why Developing erl-tr39 was a treat
Developing erl-tr39 was a particularly pleasant experience thanks to some great Erlang features.
Pattern matching
Pattern matching is a very pleasant feature to have in a programming language, and Erlang has LOTS of pattern matching. In Erlang, you can both construct and match integers using different radices. In confusables.txt, characters codes are written in hexadecimal. This means the generated function heads can closely resemble lines from confusables.txt.
% 0340 ; 0300 ; MA # ( ̀ → ̀ ) COMBINING GRAVE TONE MARK → COMBINING GRAVE ACCENT #prototype(16#340) -> 16#0300;
Matching and constructing UTF-8 characters in binaries
UTF-8 is the most common encoding for Unicode characters. It uses between 1 and 4 bytes to represent a character, using the most significant bits of each byte to store some metadata. Here’s a handy table that shows how to encode characters using UTF-8.
Erlang supports constructing and matching UTF-8 in binaries. If you want to construct a UTF-8 encoded binary for character 9733 (BLACK STAR), you could write <<”★”/utf8>>, and you can use the same /utf8 suffix if you want to match a UTF-8 encoded character in a binary.
Thanks to this feature, I didn’t have to spend any of my “code beauty budget” on encoding and decoding UTF-8 characters.
Bit string comprehensions and bit string generators
Now that I can match UTF-8 characters, I am able to write a recursive function to map my tr39:prototype/1 function over every UTF-8 character in a binary.
(NOTE: While this code is nice and short, it might not perform very well on large binaries due to having to reallocate binaries as they grow. It might be better to construct an iolist, or to construct many small binaries and concatenate them at the end with something like iolist_to_binary/1. Benchmarking is your friend in cases like this!)
The above code is pretty clean, but it could become even cleaner with a bit string comprehension:
Erlang list comprehensions are pretty popular, because you can compose generators and filters on the right of the “||”, but it was a pleasant surprise to remember that bit string comprehensions exist.
What’s next?
We currently use the library in production, and, for now, it is sufficient for our purposes. That said, we’re committed to continuing development in the open, and we hope that others may find this library useful.
𝖋𝒾ո
Thanks to several nice Erlang features, creating erl-tr39 was a fun and interesting process. We’re delighted to share erl-tr39 with the FLOSS community, and we hope that others find it useful. Feedback and contributions to erl-tr39 will be met with appreciation and gratitude.