[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1014029: invisible malicious unicode in source code - detection and prevention



Package: general
Severity: normal

Quote https://trojansource.codes

> Some Vulnerabilities are Invisible

> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
> 

> These adversarial encodings produce no visual artifacts.

> The trick is to use Unicode control characters to reorder tokens in
source code at the encoding level.
> These visually reordered tokens can be used to display logic that,
while semantically correct, diverges from the logic presented by the
logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source
code, not the visual order.

> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.

> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

> The defense

- > Compilers, interpreters, and build pipelines supporting Unicode
should throw errors or warnings for unterminated bidirectional control
characters in comments or string literals, and for identifiers with
mixed-script confusable characters.

- > Language specifications should formally disallow unterminated
bidirectional control characters in comments and string literals.

- > Code editors and repository frontends should make bidirectional
control characters and mixed-script confusable characters perceptible
with visual symbols or warnings.

additional ideas to protect from this:

- **check if potential existing compromises:** scan all source code for
existing unicode

- **educate existing and future source code reviewers:** add a source
code reviewer policy which existing and future reviewers need to
acknowledge that they understand the issue.

- **remove as much unicode from source code as possible**: by reducing
the amount of unicode in source code, audits for malicious unicode with
automated tools gets simpler. If possible, if unicode is considered
essential, instead of writing `®` when required it should be encoded as
`®`.

- **local check by reviewer:** document tools that source code reviewers
could/should use to scan future contributions for malicious unicode

- **lintian check:** a lintian test that notifies when unicode is
included in the source code.

- **build scripts / CI scripts:** should check if there is unicode in
any files except in opt-in expected files defines in a list. If there is
any unexpected unicode in unexpected files, the build should error out.

- **scan upstream projects source code**: check if these are compromised
by malicious unicode.

- **notify upstream projects**: these might not be aware of this issue
and already compromised by malicious unicode.

how to check example:

grep_args="--exclude=changelog.upstream --exclude-dir=.git
--binary-files=without-match --recursive --color=auto -P -n"

LC_ALL=C grep $grep_args '[^\x00-\x7F]'

LC_ALL=C grep $grep_args "[^[:ascii:]]"

A few other tools might be desirable in case grep can ever be tricked to
miss anything.


Reply to: