Bug#1014029: invisible malicious unicode in source code - detection and prevention
Package: general
Severity: normal
Quote https://trojansource.codes
> Some Vulnerabilities are Invisible
> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
>
> These adversarial encodings produce no visual artifacts.
> The trick is to use Unicode control characters to reorder tokens in
source code at the encoding level.
> These visually reordered tokens can be used to display logic that,
while semantically correct, diverges from the logic presented by the
logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source
code, not the visual order.
> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.
> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.
> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.
> The defense
- > Compilers, interpreters, and build pipelines supporting Unicode
should throw errors or warnings for unterminated bidirectional control
characters in comments or string literals, and for identifiers with
mixed-script confusable characters.
- > Language specifications should formally disallow unterminated
bidirectional control characters in comments and string literals.
- > Code editors and repository frontends should make bidirectional
control characters and mixed-script confusable characters perceptible
with visual symbols or warnings.
additional ideas to protect from this:
- **check if potential existing compromises:** scan all source code for
existing unicode
- **educate existing and future source code reviewers:** add a source
code reviewer policy which existing and future reviewers need to
acknowledge that they understand the issue.
- **remove as much unicode from source code as possible**: by reducing
the amount of unicode in source code, audits for malicious unicode with
automated tools gets simpler. If possible, if unicode is considered
essential, instead of writing `®` when required it should be encoded as
`®`.
- **local check by reviewer:** document tools that source code reviewers
could/should use to scan future contributions for malicious unicode
- **lintian check:** a lintian test that notifies when unicode is
included in the source code.
- **build scripts / CI scripts:** should check if there is unicode in
any files except in opt-in expected files defines in a list. If there is
any unexpected unicode in unexpected files, the build should error out.
- **scan upstream projects source code**: check if these are compromised
by malicious unicode.
- **notify upstream projects**: these might not be aware of this issue
and already compromised by malicious unicode.
how to check example:
grep_args="--exclude=changelog.upstream --exclude-dir=.git
--binary-files=without-match --recursive --color=auto -P -n"
LC_ALL=C grep $grep_args '[^\x00-\x7F]'
LC_ALL=C grep $grep_args "[^[:ascii:]]"
A few other tools might be desirable in case grep can ever be tricked to
miss anything.
Reply to: