Bug#1014029: invisible malicious unicode in source code - detection and prevention

To: submit@bugs.debian.org
Subject: Bug#1014029: invisible malicious unicode in source code - detection and prevention
From: Patrick Schleizer <adrelanos@whonix.org>
Date: Tue, 28 Jun 2022 21:46:12 +0000
Message-id: <[🔎] ada4a6e1-a1f2-fb3a-5cef-bc3591f5ffe4@whonix.org>
Reply-to: adrelanos@whonix.org, 1014029@bugs.debian.org

Package: general
Severity: normal

Quote https://trojansource.codes

> Some Vulnerabilities are Invisible

> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
> 

> These adversarial encodings produce no visual artifacts.

> The trick is to use Unicode control characters to reorder tokens in
source code at the encoding level.
> These visually reordered tokens can be used to display logic that,
while semantically correct, diverges from the logic presented by the
logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source
code, not the visual order.

> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.

> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

> The defense

- > Compilers, interpreters, and build pipelines supporting Unicode
should throw errors or warnings for unterminated bidirectional control
characters in comments or string literals, and for identifiers with
mixed-script confusable characters.

- > Language specifications should formally disallow unterminated
bidirectional control characters in comments and string literals.

- > Code editors and repository frontends should make bidirectional
control characters and mixed-script confusable characters perceptible
with visual symbols or warnings.

additional ideas to protect from this:

- **check if potential existing compromises:** scan all source code for
existing unicode

- **educate existing and future source code reviewers:** add a source
code reviewer policy which existing and future reviewers need to
acknowledge that they understand the issue.

- **remove as much unicode from source code as possible**: by reducing
the amount of unicode in source code, audits for malicious unicode with
automated tools gets simpler. If possible, if unicode is considered
essential, instead of writing `®` when required it should be encoded as
`&reg;`.

- **local check by reviewer:** document tools that source code reviewers
could/should use to scan future contributions for malicious unicode

- **lintian check:** a lintian test that notifies when unicode is
included in the source code.

- **build scripts / CI scripts:** should check if there is unicode in
any files except in opt-in expected files defines in a list. If there is
any unexpected unicode in unexpected files, the build should error out.

- **scan upstream projects source code**: check if these are compromised
by malicious unicode.

- **notify upstream projects**: these might not be aware of this issue
and already compromised by malicious unicode.

how to check example:

grep_args="--exclude=changelog.upstream --exclude-dir=.git
--binary-files=without-match --recursive --color=auto -P -n"

LC_ALL=C grep $grep_args '[^\x00-\x7F]'

LC_ALL=C grep $grep_args "[^[:ascii:]]"

A few other tools might be desirable in case grep can ever be tricked to
miss anything.

Reply to:

Follow-Ups:
- Bug#1014029: invisible malicious unicode in source code - detection and prevention
  - From: Stephan Verbücheln <erlenmayr@gmail.com>
- Bug#1014029: marked as done (invisible malicious unicode in source code - detection and prevention)
  - From: "Debian Bug Tracking System" <owner@bugs.debian.org>

Prev by Date: Re: questionable massive auto-removal: buggy deps nvidia-graphics-drivers-tesla-470
Next by Date: Re: questionable massive auto-removal: buggy deps nvidia-graphics-drivers-tesla-470
Previous by thread: Re: questionable massive auto-removal: buggy deps nvidia-graphics-drivers-tesla-470
Next by thread: Bug#1014029: invisible malicious unicode in source code - detection and prevention
Index(es):
- Date
- Thread