Is Unicode Considered Harmful?

Published in

Atlas

4 min readDec 8, 2021

Recently, a set of source-level security issues known as “trojan horse source vulnerabilities have been discussed. (Links at the bottom of this article.) In the Trojan Source vulnerability, two distinct source-level attack vectors were identified. Let’s break down what that means to us as developers, and how we can protect our source code, and our customers, from these vulnerabilities. In this article, I will be using Swift to demonstrate these issues.

Vulnerability one: RTL push/pop in unicode

The first vulnerability identified is a vulnerability based on injecting right-to-left unicode language display modifiers into otherwise innocuous looking code. The code is interpreted as left-to-right until finding one of the modifiers, then it is parsed right-to-left, until a pop is located. Then it returns to left-to-right.

As an example:

While the if statement starting after the comment looks odd, it’s not invalid code. However, there are some pairs of RTL tokens embedded in this function. This will cause the if check to be disregarded, and will always pass. This is very easy to miss if you are merely glancing over the code.

Vulnerability two: homoglyphs

The second vulnerability is based upon using homoglyphs, or characters that look similar to one another. With this, two variables or functions with similar looking names are used to obscure true behavior. As an example:

In this example, there are two functions which appear to have the same name. The second uses a homoglyph to replace the ‘h’ character, however, which means that it is not easily possible to know which of the two functions is called. This is easy to spot if the two functions are next to each other (as in the example), but much more challenging if they are spread out throughout the code.

How these vulnerabilities can be exploited

The reason these attack vectors work is that they _look_ like valid source, but hide bad intent behind trickery. Reading code only goes so far if the code looks like it does what we want it to do.

Programmers often copy & paste code from elsewhere, including sites like Stack Overflow. Oftentimes this is entirely rational, because there are typically not multiple solutions to a given problem. By copying and pasting said code, however, a programmer makes their project vulnerable to an exploit in that code. Programmers should be encouraged to implement these canonical solutions themselves, even when working from established patterns and sources. This prevents a vulnerability from being exploited via copy & paste.

Other code is brought into our software projects through selected dependencies; that is to say, software from external and often open-source projects. This is not only a common practice, but this practice is part and parcel of creating modern projects. Many of our current development techniques are founded on this practice of standing on the shoulders of those who have come before us. However, this practice does bring a risk, and this vulnerability is another example of how. By carefully vetting dependencies, we can protect our code and our customers. This will mean running a check for these on a routine basis when these libraries are updated to new versions.

One way to protect your code

For programmers who do not use languages expressible only via unicode, the solution is really quite simple. One can disable the use of unicode within your source at a very low cost, which would have few real-world impacts.

For example, while performing a quick scan across multiple projects, I found only ten characters that warranted using unicode; these were allowed as project-level exceptions. Going forward, the project teams will be reminded whenever they try to add unicode that it isn’t allowed. This is accomplished by using a custom SwiftLint rule.

This solution won’t work for teams that need to routinely use text outside the ASCII character range for their actual work. Those teams could investigate simliar rules, or they may need to use a more sophisticated approach. The approach outlined above is definitely a blunt instrument, meant to provide immediate relief and prevent ourselves from needing to think too hard about the problem.

The more time we spend using our programming tools to prevent ourselves from making mistakes, the more time we can spend focusing on the actual problems we want to solve.

References

https://www.trojansource.codes

https://certitude.consulting/blog/en/invisible-backdoor/

https://unicode.org/reports/tr36/

Is Unicode Considered Harmful?

Written by Russell Mirabelli