Just in case someone wants to point out that I'm wrong (highly possible) you can shout at me on
Friday, December 20, 2024
Wednesday, December 11, 2024
Lost in Translation: Challenges of Internationalisation
I recently presented at Hack::Soho on the subject on internationlisation and the security issues that it can present, covering the Mongolian Vowel Separator problem discussed here
Come and see what other empires can cause your software's downfall.
Friday, December 6, 2024
Space Invaders
“Many eyeballs make all bugs shallow” – Linus’s Law
Introduction
The "Many Eyes Makes All Bugs Shallow" is an open-source conceit comforting people into believe that open-source code is more secure than its closed-source counterpart. The idea being that everyone that uses the code can review the code. However, it can fail on several points.
Firstly, not everyone is looking at the code. We've seen a recent up-tick in developers and users requesting that code maintainers release the .exe or library precompiled on their GitHub pages, suggesting that the users are not interested in reading or understanding how the code works, simply that it works.
Secondly, it requires that the code reviewer can identify security issues. Many are obvious, but some (like Heartbleed for example) are subtle, subtle enough that even static code analysers could not identify the issue at the time.
These are fundamental problems, especially with the heightened attention on supply chain issues. The community recently dodged a bullet with the xz “backdoor”. Open-source code and libraries run major, nay fundamental, components on the wider internet.
So, what if, on top of the two previous issues with "Many Eyes", it is possible not just to obfuscate the security bug, but to make it "invisible"?
Unicode and Whitespaces, Oh My
Let start by delving into the world of Unicode.
The Unicode standard updates approximately yearly, in most cases you will become aware of any changes when iPhone or Android release an update with *all new* emojis.
However, sometimes existing things are changed, and a result of those changes can lead to funky behaviour regarding older version support, leading to inconsistencies in behaviour with different library versions, and/or misunderstanding when dealing with certain characters.
As we are looking for “invisible” characters, then formatting characters or whitespace characters is an obvious place to start. if we can also find a such a character that has had an interesting history within the Unicode standard…
Mongolian Vowel Separator
Introducing the Mongolian Vowel Separator (MVS) https://unicode-explorer.com/c/180E
- Codepoint - U+180E
- Block - Mongolian
- Category - Cf/Other, format
However, this was not always the case, many moons ago the Mongolian Vowel Separator was first introduced into Unicode 3.0.0 in the Cf category. In version 4.0.0 it was moved to the Zs Space Separator category, before finally moving back into Cf in version 6.3.0.
The reason for these changes probably relates to the purpose of the MVS, typographically it should only be used before the word final vowels Mongolian Letter A (U+1820) and Mongolian Letter E (U+1821) and is used to determine the specific form of the characters preceding it and produces a small gap in the word. But it is not a “white space” character in the traditions sense in the form of a word terminator or separator, instead it is used within a given word to change its display and ultimately meaning.
Zero Width Space
The Zero Width Space (ZWS) https://unicode-explorer.com/c/200B
- Codepoint - U+200B
- Block – General Punctation
- Category - Cf/Other, format
The Zero Width Space has not had a history of changing categories, but has similar properties to the MVS in that, whilst being a whitespace character, it too does not have a “size” in the typical sense, instead it is intended for invisible word separation and line break controls. Unlike the MVS, typographically, the Zero Width Space is a word terminating or separating whitespace.
Other whitespace characters are available (https://en.wikipedia.org/wiki/Whitespace_character) but we will focus on these two.
Experimenting with the Mongolian Vowel Separator
Ok, so we have some interesting characters, how then does this tie into our odd code example above. I spent most of my developer life writing in C, so that’s where I started and came up with the following innocuous code sample:
#include <stdio.h>#include <stdlib.h>int main (void){int intadmin = 1;// clear the admin flagintadmin = 0;if(intadmin == 1){printf("you are admin\n");}return 0;}
At first glance, look innocent enough, we initialise intadmin to be 1. Admittedly not an ideal or secure initialisation, but the following line looks to reset the value to 0, before being checked to see if you are, in fact, admin. By visual inspection and walkthrough, you should not be admin.
However, on line 8 which looks to read:
intadmin = 0;
There is a Mongolian Vowel Separator between int and admin:
int[U+180E]admin = 0
Given this I expected the compiler to behave in one of three ways:
- Spit out and error and fail to compile the code.
- Ignore the MVS character and update the real intadmin to 0 therefore not granting admin access.
- Treat the MVS as a space therefore dealing with the line 8 as
int admin = 0;
leaving us with admin access as the real intadmin remains at 1.
There are two obvious compilers for C code within the Linux environment, gcc and clang.
So, testing with gcc (version: 13.2.0 (Debian 13.2.0-13)) with no flags, we get the following output when compiling the Mongolian Vowel Separator code:
┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ gcc MVS_test.cMVS_test.c: In function ‘main’:MVS_test.c:10:12: error: stray ‘\341’ in program10 | int<U+180E>admin = 0;
| ^~~~~~~~
So result 1, an error Which obviously does not help us, so lets try with clang (version: Debian clang version 16.0.9 (19)):
┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ clang MVS_test.cMVS_test.c:10:5: warning: treating Unicode character as whitespace [-Wunicode-whitespace]int<U+180E>admin = 0;^~~~~~~~1 warning generated.┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ ./a.outyou are admin
Result 3 Double win! Not only does it compile but executes as we had hoped! The code looks like one thing and compiles like something else. Additionally clang helpfully tells us how to silence the warning that could give the game away:
clang -Wno-unicode-whitespace MVS_test.c
Works silently. So, what we see is that clang treats the MVS as a whitespace separator, ultimately treating line 8 as
int admin = 0;
and as a result, the if comparison evaluates to true as a result, we are sneakily, admin.
For completeness the same code tested in Visual Studio 22 fails to compile.
Experimenting with the Zero Width Space
Now, let us see what happens if we were to replace the MVS with a Zero Width Space. Now we get the following:
┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ gcc ZWS_test.cZWS_test.c: In function ‘main’:ZWS_test.c:8:9: error: ‘intadmin’ undeclared (first use in this function); did you mean ‘intadmin’?8 | intadmin = 0;| ^~~~~~~~| intadminZWS_test.c:8:9: note: each undeclared identifier is reported only once for each function it appears in┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ clang ZWS_test.cZWS_test.c:8:5: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]int<U+200B>admin = 0;^~~~~~~~ZWS_test.c:8:2: error: use of undeclared identifier 'intadmin'; did you mean 'intadmin'?int<U+200B>admin = 0;^~~~~~~~~~~~~~~~intadminZWS_test.c:6:6: note: 'intadmin' declared hereint intadmin = 1;^1 warning and 1 error generated.
Both compilers fail to compile. But looking at the error, the compiler is not so much concerned with the odd character, but rather that int<U+200B>admin is an undefined variable.
This means that the Zero Width Space is being treated as part of, albeit invisible, the identifier name.
So now we can have visual confusion with identical looking identifiers (by visual inspection) such as:
#include <stdio.h>#include <stdlib.h>// intadmin contains a Zero Width Space U+200Bint main (void){int intadmin = 1;int intadmin = 0;// clear the admin flagintadmin = 0;if(intadmin == 1){printf("you are admin\n");}return 0;}
Now lines 9 and 11
int intadmin = 0;
and
intadmin = 0;
both have Zero Width Spaces at int<U+200B>admin and both clang and gcc compilers will compile and run the confusing code.
┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ clang ZWS_test.cZWS_test.c:9:9: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]int int<U+200B>admin = 0;^~~~~~~~ZWS_test.c:11:5: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]int<U+200B>admin = 0;^~~~~~~~2 warnings generated.┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ ./a.outyou are admin┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ gcc ZWS_test.c┌──(kali㉿kali)-[/home/kali/SpaceInvaders]└─$ ./a.outyou are admin
with clang producing the warning that we’ve seen with the Mongolian Vowel Separator.
In summary, clang treats the Mongolian Vowel Separator as a hidden space/word separator and both gcc and clang will treat the Zero Width Space as just a character within an identifier. Almost the exact opposite behaviour of these characters in their typographical function in written words.
Obviously the Zero Width Space example shown here makes it obvious that there are two intadmin variables, but in a larger, more complex codebase, it would be trivial to hide the declaration of the second, malicious version of any given identifier.
Taking it to the Extreme
Because the Zero Width Space is treated as an identifier
character, we can take this to an obvious extreme by simply having different
numbers of ZWS’s as our unique identifiers, we can have code that looks like:
#include <stdio.h>#include <stdlib.h>#define printfint main (void){int = 1;int = 0;int = 2;=;if(==){("what's going on?\n");}return 0;}
Which compiles and runs:
┌──(kali㉿kali)-[~/Projects/Internationalisation/Clang]└─$ gcc test2.c┌──(kali㉿kali)-[~/Projects/Internationalisation/Clang]└─$ ./a.outwhat's going on?
Here we can see the actual “names” of the identifiers made up purely of larger and larger numbers of Zero Width Spaces.
This has been identified as a risk in the Unicode standard (https://www.unicode.org/reports/tr39/#Identifier_Characters states that identifiers should have the XID_start and XID_continue property which neither the Zero Width Space nor the Mongolian Vowel Separator has (https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt)
Other Languages
I started this research using C as it’s the language I am most familiar with, however there’s more to life and programming that C, so I started to look at other languages to see how they behave.
After extensive research (and in some cases quickly learning the syntax) I found that most language will simply fail to compile when they encountered either the Zero Width Space or the Mongolian Vowel Separator. The languages tested were:
- Python
- Rust
- Go
- Perl
- Javascript (Node)
I also experimented briefly with shell scripting languages
- Bash
- Powershell
- Cmd – batch commands
All of which also failed.
So, lets see some of the more interesting results
C#
Failed, but in a interesting manner, in that it actually ignores both the Zero Width Space and Mongolian Vowel Separator.
The code:
internal class MVS_test{// intadmin contains a Mongolian Vowel Separatorprivate static void Main(string[] args){int intadmin = 1;// Clear the admin flagintadmin = 0;if (intadmin == 1){Console.WriteLine("you are admin");}}}
Replicates the C example with a Mongolian Vowel Separator in the line:
intadmin = 0;
between int and admin. However, the compiler simply ignored the character and “correctly” cleared the intadmin flag, so you were not admin.
Java
Java code treats both the Mongolian Vowel Separator and Zero Width Space as an identifier character, allowing for code like:
public class MVS_Test {public static void main(String[] args) {int intadmin = 1;int intadmin = 0;// clear the intadmin flagintadmin = 0;if(intadmin == 1){System.out.println("you are admin\n");}}}
Much like it’s C language equivalent, lines 5 and 8 can contain either the Mongolian Vowel Separator or Zero Width Space, such that compiling and running the code with produce the output
you are admin
Ruby
Ruby also treats both characters as identifier characters, and has the additional language property that you do not define the type of a variable upon first use, this allows for a degree of flexibility as an attacker that makes the code look more natural:
@admin = 1;# clear the adminflag@admin= 0;if(@admin == 1)puts("you are admin");end
Line 3 can contain our Zero Width Space or Mongolian Vowel Separator and thus creates a new variable @ad<U+200B>min that is not the one used in the if statement, and as a result you are, despite code appearance, “admin”
Swift
Swift fails to compile code contains the Mongolian Vowel Separator, however in line with the other “working” examples, treats the Zero Width Space as an identifier character, allowing for the following example:
var isadmin = 1;var isadmin = 0;//clear isadminisadmin = 0;if(isadmin == 1){print("You are Admin")}
Where lines 2 and 4 the contain the variable
is<U+200B>admin
and as before you are admin. And again as before an attacker would have to hide the definition of the “malicious” variable, but in a large enough code-base, this could be trivial.
Language Results:
So from our experimentation with other languages, we see the following results
- The character in question is treated as an invisible whitespace
- The character in question is treated as part of the identifier (e.g. makes a new identifier)
- The code fails to compile/run
- The characters are ignored
|
Language |
Version |
MVS behaviour |
ZWS behaviour |
|
C (Clang) |
Debian clang
version 16.0.6 (19) |
1 |
2 |
|
C (gcc) |
gcc
version 13.2.0 (Debian 13.2.0-13) |
3 |
2 |
|
C (VS) |
19.39.33519
for x64 |
3 |
3 |
|
C# |
4.9.0-3.24081.11
(98911739) |
4 |
4 |
|
Java |
JavaSE 17 |
2 |
2 |
|
Python |
Python
3.12.2 (main, Feb 7 2024, 20:47:03)
[GCC 13.2.0] on linux |
3 |
3 |
|
Ruby |
ruby 3.1.2p20
(2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] |
2 |
2 |
|
Rust |
rustc 1.70.0 |
3 |
3 |
|
Go |
go version go1.21.7
linux/amd64 |
3 |
3 |
|
Swift |
5.10.1 |
3 |
2 |
|
Perl |
v5.38.2 |
3 |
3 |
|
Javascript (node) |
v18.19.1 |
3 |
3 |
Editor Behaviour/Syntax highlighting
So, whilst an attacker can create malicious, but innocent looking code, the attacker has to deal with the problem of syntax highlighting code editors. We now investigate how editors show (or not) these hidden characters. The editors tested are using their default syntax highlighting for the appropriate language, changing to light/dark mode or alternative themes were not tested, so your milage may vary.
Visual Studio Code
Visual Studio Code does a good job of indicating that something is different with the code, simply opening the file we see that the Mongolian Vowel Separator is highlighted:
And can be seen when the mouse hovers over:
The same behaviour occurs when dealing with the Zero Width Space.
There is an option to disable highlighting of invisible characters, but this does not change the syntax highlighting which indicates a difference between int (blue) and admin (white/light grey)
Visual Studio
Whilst the attempts to compile the code on Visual Studio 22 failed with error conditions, it’s still worth seeing if the syntax highlighting would spot the code. If we open the file directly within Visual Studio (not as part of an existing project) we see:
The syntax highlighting seems to differentiate between the two intadmin types, and when the file is included in a project:
It becomes more obvious that something is wrong with the code. These results are replicated when using the Zero Width Space character as well.
Notepad++
By default we see the hidden character quite obviously:
However, within notepad++ there is an option View->Show Symbol->Show Non-printing Characters, if this is disabled we see the following:
Vi/Vim
Vi and VIM show us the Mongolian Vowel Separator and Zero Width Space as:
Emacs
Emacs, on the other hand, does not show us the Mongolian Vowel Separator or Zero Width Space character, but makes it obvious by way of syntax highlighting:
Eclipse
Eclipse is typically the domain of Java code, so looking at the working Java example within Eclipse we see:
Both the Zero Width Space and Mongolian Vowel Separator are not visible, nor is there any difference in syntax highlight to indicate that something is up with the code.
Clearly Eclipse is the editor of choice for hiding our malicious code in Java.
Code Repositories
So far we have identified languages and characters that allow us to create code that looks one way and acts another, allowing a bad actor the ability to hide malicious code or a potential backdoor within a codebase. The next obvious question is can we put our bad code somewhere where it will not be seen, but still used. We must therefore look at code repositories. Here we shall investigate three
- GitHub (home of 28 million public repositories)
- GitLab
- BitBucket
GitHub
GitHub has a desktop application that allows developers to manage their repositories and push changes up to Github.com. The tool allows the user the ability to review the history of any file and the changes made to them.
Looking at our malicious example:
and zooming in to the interesting part:
The syntax highlighting here does not indicated in anyway that line 10 contains our evil Mongolian Vowel Separator Character. The same is true for the code with the Zero Width Space:
The Github.com website itself has a number of themes that can affect the syntax highlighting colours used, but there are two “defaults”, Light Default and Dark Default. I tend to work with the dark theme for most things, so viewing our code we can see:
There is a very subtle change on line 10 between the int (light grey) and admin (white) which is likely to go un-noticed.
The in-built editor mode however
Does not have this subtle change!
Does not show any differences, this is replicated in the editor as well:
So, there is scope for hiding our Mongolian Vowel Separator in code stored and published on GitHub
GitLab
Gitlab uses VSCode as its web API, so it highlights the hidden character when editing files stored in GitLab:
However the code display exhibits a similar problem to GitHub, in that the “Light” syntax highlighting themes may be too subtle to spot any oddities:
The Dark themes make the code differences more obvious. This behaviour is the same when dealing with the Zero Width Space.
When viewing the committed change, the code is syntax highlighted, but like the viewer, the highlighting is subtle, and hard to spot:
Bitbucket
During the initial research the Bitbucket Editor did not highlight the syntax in its default mode:
Making it impossible to spot the hidden character by differences in the syntax highlighting.
The viewer, however, shows a subtle difference (the int keyword is slightly bolder):
But again, this may be missed.
Since reporting this to Atlassian, they have altered the syntax highlighting in the viewer:
However there is no change in the editor.
There is a more obvious difference when using the Zero Width Space, the editor clearly shows the hidden character:
However, the viewer exhibits the same behaviour as it does when handling the Mongolian Vowel Separator.
The committed change does not have any syntax highlighting visible, and therefore would not be spotted, if performed a code review, the reviewer would likely miss the hidden character.
Results
|
|
C |
Ruby |
Swift |
Java |
|
GitHub Desktop App |
Malicious code
hidden |
Malicious code
hidden |
Syntax highlighting
is obvious |
Malicious code
hidden |
|
GitHub |
Viewer – very
subtle syntax highlighting Editor – malicious
code hidden |
Viewer - Malicious
code hidden Editor – malicious
code hidden |
Viewer - Malicious
code hidden Editor - Syntax
highlighting is obvious |
Viewer – Malicious
code hidden in light mode, very subtle syntax highlighting in dark mode Editor – Syntax
highlighting is obvious |
|
GitLab |
Viewer - very
subtle syntax highlighting in light mode, more obvious in dark mode Editor –
inline VS code highlights missing character |
Viewer –
Syntax highlighting is obvious. Editor –
inline VS code highlights missing character |
Viewer - very
subtle syntax highlighting in light mode, more obvious in dark mode Editor –
inline VS code highlights missing character |
Viewer – very
subtle syntax highlighting in light mode, more obvious in dark mode Editor –
inline VS code highlights missing character |
|
Bitbucket |
Viewer – Syntax highlighting is obvious.
Editor – MVS
hidden, ZWS highlighted |
Viewer - Malicious
code hidden Editor – MVS
hidden, ZWS highlighted |
Viewer – very
subtle syntax highlighting Editor – MVS
hidden, ZWS highlighted |
Viewer - Malicious
code hidden Editor – MVS
hidden, ZWS highlighted |
In most cases all the repositories viewers are either very subtle in their highlighting (and therefore could pass a visual code inspection) or invisible to the naked eye.
When editing on the websites, only GitLab by using Visual Studio Code is consistently showing the hidden characters.
Prior Work
This research was inspired by two previous works, firstly Trojan Source (https://trojansource.codes/) where Unicode bi-directional control characters were introduced into source code such that the code that was being read by a human (at say a code/pull request review stage) does not match the code that the compiler will ultimately compile. The classic example:
Contains strategically place Bi-directional control characters so that you are, in fact, admin.
Related work considers the use of homoglyph attacks with identifiers, where similar looking characters are replaced to add visual confusion e.g. replacing Latin with their Cyrillic equivalents (https://www.irongeek.com/homoglyph-attack-generator.php helps with such attempts). Trojan source also considered “invisible characters” without specifying the “invisible characters” used and noted in the original paper that such attacks failed. Here, I believe we show, given specific characters, success across several languages.
The second piece of work is a blog post from 2014 (https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/) mentioning the use of the Mongolian Vowel Separator within identifiers in C# code, and how the two different compilers (csc and Roslyn) handle them. Which identified the unusual history of the Mongolian Vowel Separator as a sometimes whitespace, sometimes control character.
Conclusions
By using unusual "whitespace" characters like the Mongolian Vowel Separator and the Zero Width Space it is possible to create code that looks, by visual inspection, like one thing, but the compiler behaviour and end results are different. Enough that many eyeballs can miss the maliciously inserted characters.
The languages that are affected by this issue are:
- C
- Ruby
- Swift
- Java
As a developer, thankfully most IDEs have the ability to highlight either "odd" syntax highlighting, or highlights the "invisible" character. Some editors can be configured to hide this, but in most cases the highlighting is on by default. The only editor that "fails" is Eclipse when dealing with Java code.
But if an attacker can get their code uploaded to one of the main code repositories, either by a malicious pull request to an existing repo, or posting an interesting, innocent looking, library. There is every chance that the lack of appropriate syntax/hidden character highlighting, that the malicious code will not be spotted
So there is scope for a very subtle, but potentially devastating supply chain attack that can bypass the many eyes problem as developers look through the code in their favourite code repository.
“Sometimes, magic is just someone spending more time on something than anyone else might reasonably expect.” - Teller
Reporting Timeline
- 27/03/2024 – Report issue to GitHub, Atlassian (BitBucket), GitLab
- 27/03/2024 – GitHub issue marked as “Low Risk”
- 28/03/2024 – GitLab issue marked as duplicate, previously reported 28/07/2021
- 25/04/2024 – Atlassian response – Considering marking as Won't Fix/Informational
- 28/05/2024 – Atlassian confirmed to handle internally.
- 01/10/2024 – Verified Atlassian has resolved some instances.
Links and References
- https://en.wikipedia.org/wiki/Linus%27s_law
- https://deliciousbrains.com/how-unicode-works/
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
- https://unicode-explorer.com/c/180E (Mongolian Vowel Separator)
- https://unicode-explorer.com/c/200B (Zero Width Space)
- https://en.wikipedia.org/wiki/Whitespace_character
- https://www.unicode.org/reports/tr39/#Identifier_Characters
- https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
- https://trojansource.codes/
- https://en.wikipedia.org/wiki/Trojan_Source
- https://www.irongeek.com/homoglyph-attack-generator.php
- https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/
Code Repositories
The following code repositories show the “malicious” code in various languages, containing their appropriate whitespace character.
- https://github.com/Parttimesecguy/SpaceInvaders
- https://gitlab.com/parttimesecguy1/Internationalisation
- https://bitbucket.org/internationalspaceinvaders/internationlisation/src/main/
Glassworm - What is it actually doing?
So following on from https://www.koi.ai/blog/glassworm-first-self-propagating-worm-using-invisible-code-hits-openvsx-marketplace , I manag...
-
“Many eyeballs make all bugs shallow” – Linus’s Law “A magician makes the visible invisible…” – Marcel Marceau Introduction The "...
-
I recently presented at Hack::Soho on the subject on internationlisation and the security issues that it can present, covering the Mongolian...
-
Just in case someone wants to point out that I'm wrong (highly possible) you can shout at me on Bluesky Mastadon Instagram Linkedin Git...

























