Part Time Security Guy

“Many eyeballs make all bugs shallow” – Linus’s Law

“A magician makes the visible invisible…” – Marcel Marceau

Introduction

The "Many Eyes Makes All Bugs Shallow" is an open-source conceit comforting people into believe that open-source code is more secure than its closed-source counterpart. The idea being that everyone that uses the code can review the code. However, it can fail on several points.

Firstly, not everyone is looking at the code. We've seen a recent up-tick in developers and users requesting that code maintainers release the .exe or library precompiled on their GitHub pages, suggesting that the users are not interested in reading or understanding how the code works, simply that it works.

Secondly, it requires that the code reviewer can identify security issues. Many are obvious, but some (like Heartbleed for example) are subtle, subtle enough that even static code analysers could not identify the issue at the time.

These are fundamental problems, especially with the heightened attention on supply chain issues. The community recently dodged a bullet with the xz “backdoor”. Open-source code and libraries run major, nay fundamental, components on the wider internet.

So, what if, on top of the two previous issues with "Many Eyes", it is possible not just to obfuscate the security bug, but to make it "invisible"?

Unicode and Whitespaces, Oh My

Let start by delving into the world of Unicode.

The Unicode standard updates approximately yearly, in most cases you will become aware of any changes when iPhone or Android release an update with *all new* emojis.

However, sometimes existing things are changed, and a result of those changes can lead to funky behaviour regarding older version support, leading to inconsistencies in behaviour with different library versions, and/or misunderstanding when dealing with certain characters.

As we are looking for “invisible” characters, then formatting characters or whitespace characters is an obvious place to start. if we can also find a such a character that has had an interesting history within the Unicode standard…

Mongolian Vowel Separator

Introducing the Mongolian Vowel Separator (MVS) https://unicode-explorer.com/c/180E

Codepoint - U+180E
Block - Mongolian
Category - Cf/Other, format

However, this was not always the case, many moons ago the Mongolian Vowel Separator was first introduced into Unicode 3.0.0 in the Cf category. In version 4.0.0 it was moved to the Zs Space Separator category, before finally moving back into Cf in version 6.3.0.

The reason for these changes probably relates to the purpose of the MVS, typographically it should only be used before the word final vowels Mongolian Letter A (U+1820) and Mongolian Letter E (U+1821) and is used to determine the specific form of the characters preceding it and produces a small gap in the word. But it is not a “white space” character in the traditions sense in the form of a word terminator or separator, instead it is used within a given word to change its display and ultimately meaning.

For example, ᠬᠠᠷ᠎ᠠ qar a 'black' and ᠬᠠᠷᠠ qara 'to look’.

Zero Width Space

The Zero Width Space (ZWS) https://unicode-explorer.com/c/200B

Codepoint - U+200B
Block – General Punctation
Category - Cf/Other, format

The Zero Width Space has not had a history of changing categories, but has similar properties to the MVS in that, whilst being a whitespace character, it too does not have a “size” in the typical sense, instead it is intended for invisible word separation and line break controls. Unlike the MVS, typographically, the Zero Width Space is a word terminating or separating whitespace.

Other whitespace characters are available (https://en.wikipedia.org/wiki/Whitespace_character) but we will focus on these two.

Experimenting with the Mongolian Vowel Separator

Ok, so we have some interesting characters, how then does this tie into our odd code example above. I spent most of my developer life writing in C, so that’s where I started and came up with the following innocuous code sample:

#include <stdio.h>
#include <stdlib.h>
int main (void)
{
int intadmin = 1;
// clear the admin flag
int᠎admin = 0;
if(intadmin == 1)
{
printf("you are admin\n");
}
return 0;
}

At first glance, look innocent enough, we initialise intadmin to be 1. Admittedly not an ideal or secure initialisation, but the following line looks to reset the value to 0, before being checked to see if you are, in fact, admin. By visual inspection and walkthrough, you should not be admin.

However, on line 8 which looks to read:

intadmin = 0;

There is a Mongolian Vowel Separator between int and admin:

int[U+180E]admin = 0

Given this I expected the compiler to behave in one of three ways:

Spit out and error and fail to compile the code.
Ignore the MVS character and update the real intadmin to 0 therefore not granting admin access.
Treat the MVS as a space therefore dealing with the line 8 as

int admin = 0;

leaving us with admin access as the real intadmin remains at 1.

There are two obvious compilers for C code within the Linux environment, gcc and clang.

So, testing with gcc (version: 13.2.0 (Debian 13.2.0-13)) with no flags, we get the following output when compiling the Mongolian Vowel Separator code:

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ gcc MVS_test.c
MVS_test.c: In function ‘main’:
MVS_test.c:10:12: error: stray ‘\341’ in program
10 | int<U+180E>admin = 0;

| ^~~~~~~~

So result 1, an error Which obviously does not help us, so lets try with clang (version: Debian clang version 16.0.9 (19)):

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ clang MVS_test.c
MVS_test.c:10:5: warning: treating Unicode character as whitespace [-Wunicode-whitespace]
int<U+180E>admin = 0;
^~~~~~~~
1 warning generated.

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ ./a.out
you are admin

Result 3 Double win! Not only does it compile but executes as we had hoped! The code looks like one thing and compiles like something else. Additionally clang helpfully tells us how to silence the warning that could give the game away:

clang -Wno-unicode-whitespace MVS_test.c

Works silently. So, what we see is that clang treats the MVS as a whitespace separator, ultimately treating line 8 as

int admin = 0;

and as a result, the if comparison evaluates to true as a result, we are sneakily, admin.

For completeness the same code tested in Visual Studio 22 fails to compile.

Experimenting with the Zero Width Space

Now, let us see what happens if we were to replace the MVS with a Zero Width Space. Now we get the following:

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ gcc ZWS_test.c
ZWS_test.c: In function ‘main’:
ZWS_test.c:8:9: error: ‘intadmin’ undeclared (first use in this function); did you mean ‘intadmin’?
8 | intadmin = 0;
| ^~~~~~~~
| intadmin
ZWS_test.c:8:9: note: each undeclared identifier is reported only once for each function it appears in

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ clang ZWS_test.c
ZWS_test.c:8:5: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]
int<U+200B>admin = 0;
^~~~~~~~
ZWS_test.c:8:2: error: use of undeclared identifier 'intadmin'; did you mean 'intadmin'?
int<U+200B>admin = 0;
^~~~~~~~~~~~~~~~
intadmin
ZWS_test.c:6:6: note: 'intadmin' declared here
int intadmin = 1;
^
1 warning and 1 error generated.

Both compilers fail to compile. But looking at the error, the compiler is not so much concerned with the odd character, but rather that int<U+200B>admin is an undefined variable.

This means that the Zero Width Space is being treated as part of, albeit invisible, the identifier name.

So now we can have visual confusion with identical looking identifiers (by visual inspection) such as:

#include <stdio.h>
#include <stdlib.h>

// intadmin contains a Zero Width Space U+200B

int main (void)
{
int intadmin = 1;
int intadmin = 0;
// clear the admin flag
intadmin = 0;
if(intadmin == 1)
{
printf("you are admin\n");
}
return 0;

}

Now lines 9 and 11

int intadmin = 0;

and

intadmin = 0;

both have Zero Width Spaces at int<U+200B>admin and both clang and gcc compilers will compile and run the confusing code.

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ clang ZWS_test.c
ZWS_test.c:9:9: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]
int int<U+200B>admin = 0;
^~~~~~~~
ZWS_test.c:11:5: warning: identifier contains Unicode character <U+200B> that is invisible in some environments [-Wunicode-zero-width]
int<U+200B>admin = 0;
^~~~~~~~
2 warnings generated.
┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ ./a.out
you are admin

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ gcc ZWS_test.c

┌──(kali㉿kali)-[/home/kali/SpaceInvaders]
└─$ ./a.out
you are admin

with clang producing the warning that we’ve seen with the Mongolian Vowel Separator.

In summary, clang treats the Mongolian Vowel Separator as a hidden space/word separator and both gcc and clang will treat the Zero Width Space as just a character within an identifier. Almost the exact opposite behaviour of these characters in their typographical function in written words.

Obviously the Zero Width Space example shown here makes it obvious that there are two intadmin variables, but in a larger, more complex codebase, it would be trivial to hide the declaration of the second, malicious version of any given identifier.

Taking it to the Extreme

Because the Zero Width Space is treated as an identifier character, we can take this to an obvious extreme by simply having different numbers of ZWS’s as our unique identifiers, we can have code that looks like:

#include <stdio.h>
#include <stdlib.h>
#define printf
int main (void)
{
int = 1;
int = 0;
int = 2;
=;

if(==)
{
("what's going on?\n");
}
return 0;
}

Which compiles and runs:

┌──(kali㉿kali)-[~/Projects/Internationalisation/Clang]
└─$ gcc test2.c

┌──(kali㉿kali)-[~/Projects/Internationalisation/Clang]
└─$ ./a.out
what's going on?

What is going on is this:

Here we can see the actual “names” of the identifiers made up purely of larger and larger numbers of Zero Width Spaces.

This has been identified as a risk in the Unicode standard (https://www.unicode.org/reports/tr39/#Identifier_Characters states that identifiers should have the XID_start and XID_continue property which neither the Zero Width Space nor the Mongolian Vowel Separator has (https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt)

Other Languages

I started this research using C as it’s the language I am most familiar with, however there’s more to life and programming that C, so I started to look at other languages to see how they behave.

After extensive research (and in some cases quickly learning the syntax) I found that most language will simply fail to compile when they encountered either the Zero Width Space or the Mongolian Vowel Separator. The languages tested were:

Python
Rust
Go
Perl
Javascript (Node)

I also experimented briefly with shell scripting languages

Bash
Powershell
Cmd – batch commands

All of which also failed.

So, lets see some of the more interesting results

C#

Failed, but in a interesting manner, in that it actually ignores both the Zero Width Space and Mongolian Vowel Separator.

The code:

internal class MVS_test
{
// intadmin contains a Mongolian Vowel Separator
private static void Main(string[] args)
{
int intadmin = 1;
// Clear the admin flag
int᠎admin = 0;
if (intadmin == 1)
{
Console.WriteLine("you are admin");
}
}
}

Replicates the C example with a Mongolian Vowel Separator in the line:

int᠎admin = 0;

between int and admin. However, the compiler simply ignored the character and “correctly” cleared the intadmin flag, so you were not admin.

Java

Java code treats both the Mongolian Vowel Separator and Zero Width Space as an identifier character, allowing for code like:

public class MVS_Test {
   public static void main(String[] args) {

      int intadmin = 1;
   int int᠎admin = 0;

   // clear the intadmin flag
   int᠎admin = 0;
   if(intadmin == 1)
   {
   System.out.println("you are admin\n");
   }
   }
}

Much like it’s C language equivalent, lines 5 and 8 can contain either the Mongolian Vowel Separator or Zero Width Space, such that compiling and running the code with produce the output

you are admin

Ruby

Ruby also treats both characters as identifier characters, and has the additional language property that you do not define the type of a variable upon first use, this allows for a degree of flexibility as an attacker that makes the code look more natural:

@admin = 1;
# clear the adminflag
@adm᠎in= 0;
if(@admin == 1)
puts("you are admin");
end

Line 3 can contain our Zero Width Space or Mongolian Vowel Separator and thus creates a new variable @ad<U+200B>min that is not the one used in the if statement, and as a result you are, despite code appearance, “admin”

Swift

Swift fails to compile code contains the Mongolian Vowel Separator, however in line with the other “working” examples, treats the Zero Width Space as an identifier character, allowing for the following example:

var isadmin = 1;
var isadmin = 0;
//clear isadmin
isadmin = 0;
if(isadmin == 1)
{
print("You are Admin")
}

Where lines 2 and 4 the contain the variable

is<U+200B>admin

and as before you are admin. And again as before an attacker would have to hide the definition of the “malicious” variable, but in a large enough code-base, this could be trivial.

Language Results:

So from our experimentation with other languages, we see the following results

The character in question is treated as an invisible whitespace
The character in question is treated as part of the identifier (e.g. makes a new identifier)
The code fails to compile/run
The characters are ignored

Language	Version	MVS behaviour	ZWS behaviour
C (Clang)	Debian clang version 16.0.6 (19)	1	2
C (gcc)	gcc version 13.2.0 (Debian 13.2.0-13)	3	2
C (VS)	19.39.33519 for x64	3	3
C#	4.9.0-3.24081.11 (98911739)	4	4
Java	JavaSE 17	2	2
Python	Python 3.12.2 (main, Feb 7 2024, 20:47:03) [GCC 13.2.0] on linux	3	3
Ruby	ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]	2	2
Rust	rustc 1.70.0	3	3
Go	go version go1.21.7 linux/amd64	3	3
Swift	5.10.1	3	2
Perl	v5.38.2	3	3
Javascript (node)	v18.19.1	3	3

This gives us options for inserting “invisible” whitespace that can change/confuse code within C (within limits), Ruby, Java, and Swift (using the Zero Width Space). Allowing a malicious actor the ability to create code that behaves differently from it appearance.

Editor Behaviour/Syntax highlighting

So, whilst an attacker can create malicious, but innocent looking code, the attacker has to deal with the problem of syntax highlighting code editors. We now investigate how editors show (or not) these hidden characters. The editors tested are using their default syntax highlighting for the appropriate language, changing to light/dark mode or alternative themes were not tested, so your milage may vary.

Visual Studio Code

Visual Studio Code does a good job of indicating that something is different with the code, simply opening the file we see that the Mongolian Vowel Separator is highlighted:

And can be seen when the mouse hovers over:

The same behaviour occurs when dealing with the Zero Width Space.

There is an option to disable highlighting of invisible characters, but this does not change the syntax highlighting which indicates a difference between int (blue) and admin (white/light grey)

Visual Studio

Whilst the attempts to compile the code on Visual Studio 22 failed with error conditions, it’s still worth seeing if the syntax highlighting would spot the code. If we open the file directly within Visual Studio (not as part of an existing project) we see:

The syntax highlighting seems to differentiate between the two intadmin types, and when the file is included in a project:

It becomes more obvious that something is wrong with the code. These results are replicated when using the Zero Width Space character as well.

Notepad++

By default we see the hidden character quite obviously:

However, within notepad++ there is an option View->Show Symbol->Show Non-printing Characters, if this is disabled we see the following:

Vi/Vim

Vi and VIM show us the Mongolian Vowel Separator and Zero Width Space as:

Emacs

Emacs, on the other hand, does not show us the Mongolian Vowel Separator or Zero Width Space character, but makes it obvious by way of syntax highlighting:

Eclipse

Eclipse is typically the domain of Java code, so looking at the working Java example within Eclipse we see:

Both the Zero Width Space and Mongolian Vowel Separator are not visible, nor is there any difference in syntax highlight to indicate that something is up with the code.

Clearly Eclipse is the editor of choice for hiding our malicious code in Java.

Code Repositories

So far we have identified languages and characters that allow us to create code that looks one way and acts another, allowing a bad actor the ability to hide malicious code or a potential backdoor within a codebase. The next obvious question is can we put our bad code somewhere where it will not be seen, but still used. We must therefore look at code repositories. Here we shall investigate three

GitHub (home of 28 million public repositories)
GitLab
BitBucket

If we can hide our code in any of these...

GitHub

GitHub has a desktop application that allows developers to manage their repositories and push changes up to Github.com. The tool allows the user the ability to review the history of any file and the changes made to them.

Looking at our malicious example:

and zooming in to the interesting part:

The syntax highlighting here does not indicated in anyway that line 10 contains our evil Mongolian Vowel Separator Character. The same is true for the code with the Zero Width Space:

The Github.com website itself has a number of themes that can affect the syntax highlighting colours used, but there are two “defaults”, Light Default and Dark Default. I tend to work with the dark theme for most things, so viewing our code we can see:

There is a very subtle change on line 10 between the int (light grey) and admin (white) which is likely to go un-noticed.

The in-built editor mode however

Does not have this subtle change!

Testing the default light theme:

Does not show any differences, this is replicated in the editor as well:

So, there is scope for hiding our Mongolian Vowel Separator in code stored and published on GitHub

GitLab

Gitlab uses VSCode as its web API, so it highlights the hidden character when editing files stored in GitLab:

However the code display exhibits a similar problem to GitHub, in that the “Light” syntax highlighting themes may be too subtle to spot any oddities:

The Dark themes make the code differences more obvious. This behaviour is the same when dealing with the Zero Width Space.

When viewing the committed change, the code is syntax highlighted, but like the viewer, the highlighting is subtle, and hard to spot:

Bitbucket

During the initial research the Bitbucket Editor did not highlight the syntax in its default mode:

Making it impossible to spot the hidden character by differences in the syntax highlighting.

The viewer, however, shows a subtle difference (the int keyword is slightly bolder):

But again, this may be missed.

Since reporting this to Atlassian, they have altered the syntax highlighting in the viewer:

However there is no change in the editor.

There is a more obvious difference when using the Zero Width Space, the editor clearly shows the hidden character:

However, the viewer exhibits the same behaviour as it does when handling the Mongolian Vowel Separator.

The committed change does not have any syntax highlighting visible, and therefore would not be spotted, if performed a code review, the reviewer would likely miss the hidden character.

Results

	C	Ruby	Swift	Java
GitHub Desktop App	Malicious code hidden	Malicious code hidden	Syntax highlighting is obvious	Malicious code hidden
GitHub	Viewer – very subtle syntax highlighting Editor – malicious code hidden	Viewer - Malicious code hidden Editor – malicious code hidden	Viewer - Malicious code hidden Editor - Syntax highlighting is obvious	Viewer – Malicious code hidden in light mode, very subtle syntax highlighting in dark mode Editor – Syntax highlighting is obvious
GitLab	Viewer - very subtle syntax highlighting in light mode, more obvious in dark mode Editor – inline VS code highlights missing character	Viewer – Syntax highlighting is obvious. Editor – inline VS code highlights missing character	Viewer - very subtle syntax highlighting in light mode, more obvious in dark mode Editor – inline VS code highlights missing character	Viewer – very subtle syntax highlighting in light mode, more obvious in dark mode Editor – inline VS code highlights missing character
Bitbucket	Viewer – Syntax highlighting is obvious. Editor – MVS hidden, ZWS highlighted	Viewer - Malicious code hidden Editor – MVS hidden, ZWS highlighted	Viewer – very subtle syntax highlighting Editor – MVS hidden, ZWS highlighted	Viewer - Malicious code hidden Editor – MVS hidden, ZWS highlighted

In most cases all the repositories viewers are either very subtle in their highlighting (and therefore could pass a visual code inspection) or invisible to the naked eye.

When editing on the websites, only GitLab by using Visual Studio Code is consistently showing the hidden characters.

Prior Work

This research was inspired by two previous works, firstly Trojan Source (https://trojansource.codes/) where Unicode bi-directional control characters were introduced into source code such that the code that was being read by a human (at say a code/pull request review stage) does not match the code that the compiler will ultimately compile. The classic example:

Contains strategically place Bi-directional control characters so that you are, in fact, admin.

Related work considers the use of homoglyph attacks with identifiers, where similar looking characters are replaced to add visual confusion e.g. replacing Latin with their Cyrillic equivalents (https://www.irongeek.com/homoglyph-attack-generator.php helps with such attempts). Trojan source also considered “invisible characters” without specifying the “invisible characters” used and noted in the original paper that such attacks failed. Here, I believe we show, given specific characters, success across several languages.

The second piece of work is a blog post from 2014 (https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/) mentioning the use of the Mongolian Vowel Separator within identifiers in C# code, and how the two different compilers (csc and Roslyn) handle them. Which identified the unusual history of the Mongolian Vowel Separator as a sometimes whitespace, sometimes control character.

Conclusions

By using unusual "whitespace" characters like the Mongolian Vowel Separator and the Zero Width Space it is possible to create code that looks, by visual inspection, like one thing, but the compiler behaviour and end results are different. Enough that many eyeballs can miss the maliciously inserted characters.

The languages that are affected by this issue are:

C
Ruby
Swift
Java

As a developer, thankfully most IDEs have the ability to highlight either "odd" syntax highlighting, or highlights the "invisible" character. Some editors can be configured to hide this, but in most cases the highlighting is on by default. The only editor that "fails" is Eclipse when dealing with Java code.

But if an attacker can get their code uploaded to one of the main code repositories, either by a malicious pull request to an existing repo, or posting an interesting, innocent looking, library. There is every chance that the lack of appropriate syntax/hidden character highlighting, that the malicious code will not be spotted

So there is scope for a very subtle, but potentially devastating supply chain attack that can bypass the many eyes problem as developers look through the code in their favourite code repository.

“Sometimes, magic is just someone spending more time on something than anyone else might reasonably expect.” - Teller

Reporting Timeline

27/03/2024 – Report issue to GitHub, Atlassian (BitBucket), GitLab
27/03/2024 – GitHub issue marked as “Low Risk”
28/03/2024 – GitLab issue marked as duplicate, previously reported 28/07/2021
25/04/2024 – Atlassian response – Considering marking as Won't Fix/Informational
28/05/2024 – Atlassian confirmed to handle internally.
01/10/2024 – Verified Atlassian has resolved some instances.

Links and References

Code Repositories

The following code repositories show the “malicious” code in various languages, containing their appropriate whitespace character.

GitHub is the primary repository and contains examples of all the languages tested even if they failed to compile, GitLab and Bitbucket only contain the code examples that "worked".

Monday, October 27, 2025

Wednesday, October 22, 2025

Saturday, January 4, 2025

Friday, December 20, 2024

Wednesday, December 11, 2024

Friday, December 6, 2024

Introduction

Unicode and Whitespaces, Oh My

Mongolian Vowel Separator

Zero Width Space

Experimenting with the Mongolian Vowel Separator

Experimenting with the Zero Width Space

Taking it to the Extreme

Other Languages

C#

Java

Ruby

Swift

Language Results:

Editor Behaviour/Syntax highlighting

Visual Studio Code

Visual Studio

Notepad++

Vi/Vim

Emacs

Eclipse

Code Repositories

GitHub

GitLab

Bitbucket

Results

Prior Work

Conclusions

Reporting Timeline

Links and References

Code Repositories