Bug#555922: libc6: UTF-8 decoding is not conforming to the Unicode standard
Package: libc6
Version: 2.10.1-5
Severity: normal
libc's decoding of UTF-8 is not conforming to the Unicode standard. In
particular, it processes:
* 5 and 6-byte sequences, that are not described in the Unicode standard.
* 4-byte sequences that decode to code points above U+10FFFF.
* surrogates U+D800 .. U+DFFF.
Also it doesn't replace ill-formed sequences with replacement characters, nor
it reports an error to the calling program.
All these sequences are ill-formed according to the Unicode standard.
[1], pages 92-94, tables 3-6 and 3-7 define well-formed UTF-8 sequences.
glibc is not going to fix it. See [2].
Nevertheless, such behavior makes glibc and eglibc not conforming to the
Unicode standard. See [1], pages 59-62, conformance clauses C1, C9, C10. In
particular, C7 reads:
> All processes and higher-level protocols are required to abide by conformance
> clause C7 at a minimum.
Such not-conforming behavior directly affects all programs that link with libc
and rely on its functions. In particular:
* sed's regexps can't match overlong byte sequences, continuation bytes that
are not parts of a sequence and first bytes that are not followed by
continuation.
* sed matches 5 and 6-byte sequences and surrogates in UTF-8.
$ printf 'a\xf8\x88\x80\x80\x80b' | sed -e 's/./x/g'
xxx
* the same applies to tac(1) in regexp mode:
$ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s $(printf '\xf8\x88\x80\x80\x80') | xxd -
0000000: 6262 6261 6161 f888 8080 80 bbbaaa.....
$ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s '.' | xxd -
0000000: 6262 62f8 8880 8080 6161 61 bbb.....aaa
* iconv() processes some ill-formed sequences, thus rendering it unusable for
santinizing or validating UTF-8 input.
$ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UTF-8 | xxd -
0000000: f888 8080 80 .....
$ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UCS-4 | xxd -
0000000: 0020 0000 . ..
$ echo '<?php print iconv("UTF-8", "UTF-8", "\xf8\x88\x80\x80\x80");' | php | xxd -
0000000: f888 8080 80 .....
The described behavior is also unsafe in security sense. There are many
possible scenarios, for example:
Malicious input is processed with glibc's regexps and some ill-formed sequences
pass through. The programmer expected that output is safe in some sense. This
result is passed to another program with a UTF-8 decoder that simply skips
ill-formed sequences (thus violating recomendation [3] to never delete
ill-formed sequences). This can lead to some strings joining unexpectedly in
place where ill-formed sequence was. Of course, second program is someehat
guilty, but it wasn't expected that it would ever get ill-formed sequences as
input.
Attached is a testcase showing the described behavior for regexps. The same
set of ill-formed strings can be used to test iconv() and all other mentioned
programs.
In order for these tests and demonstrations to work, please set LC_ALL to some
UTF-8 locale, for example:
$ export LC_ALL=ru_UA.UTF-8
As a summary: if some program wants to process UTF-8 with libc and conform to
Unicode standard, it has to invent some santinizing function that will replace
all ill-formed sequences in the input. Or it would be easier for the
programmer just to rely on some other library that conforms, for example,
libicu.
[1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
[2] http://sources.redhat.com/bugzilla/show_bug.cgi?id=2373
[3] http://unicode.org/reports/tr36/#UTF-8_Exploit
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (900, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.32-rc6-04nov2009 (SMP w/2 CPU cores)
Locale: LANG=ru_UA.UTF-8, LC_CTYPE=ru_UA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages libc6 depends on:
ii libc-bin 2.10.1-5 GNU C Library: Binaries
ii libgcc1 1:4.4.1-4 GCC support library
libc6 recommends no packages.
Versions of packages libc6 suggests:
ii debconf [debconf-2.0] 1.5.28 Debian configuration management sy
ii glibc-doc 2.10.1-5 GNU C Library: Documentation
ii locales 2.10.1-5 GNU C Library: National Language (
-- debconf information excluded
/*
* Compile with gcc -W -Wall -Werror -std=c99
*/
#define _GNU_SOURCE 1
#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <mcheck.h>
static struct
{
const char *pattern;
const char *string;
} tests[] =
{
/*
* No match.
*/
{ "\\(.\\)", "\xc0\xaf" }, /* overlong 2-byte sequence for U+002F */
{ "\\(.\\)", "\xe0\x80\xaf" }, /* overlong 3-byte sequence for U+002F */
{ "\\(.\\)", "\xf0\x80\x80\xaf" }, /* overlong 4-byte sequence for U+002F */
/* continuation byte that is not part of a sequence */
{ "\\(.\\)", "\x80" },
{ "\\(.\\)", "\x90" },
{ "\\(.\\)", "\xaa" },
{ "\\(.\\)", "\xbf" },
/* first byte that is not followed by a continuation. \x61 -- 'A' */
{ "\\(.\\)", "\xc2\x61" }, /* 2-byte sequence */
{ "\\(.\\)", "\xe0\x61" }, /* 3-byte sequence */
{ "\\(.\\)", "\xf0\x61" }, /* 4-byte sequence */
/*
* Matches, but no substitution.
*/
/* UTF-8 only defines 1, 2, 3 and 4-byte sequences. */
{ "\\(.\\)", "\xf8\x88\x80\x80\x80" }, /* 5-byte sequence, U+200000 */
{ "\\(.\\)", "\xfc\x84\x80\x80\x80\x80" }, /* 6-byte sequence, U+4000000 */
{ "\\(.\\)", "\xed\xa0\x80" }, /* higher surrogate, U+D800 */
{ "\\(.\\)", "\xed\xa0\x91" }, /* higher surrogate, U+D811 */
{ "\\(.\\)", "\xed\xaf\xbf" }, /* higher surrogate, U+DBFF */
{ "\\(.\\)", "\xed\xb0\x80" }, /* lower surrogate, U+DC00 */
{ "\\(.\\)", "\xed\xb0\x91" }, /* lower surrogate, U+DC11 */
{ "\\(.\\)", "\xed\xbf\xbf" }, /* lower surrogate, U+DFFF */
{ "\\(.\\)", "\xed\xa0\x80\xed\xb0\x80" }, /* paired surrogates, U+D800 + U+DC00 = U+10000 */
{ "\\(.\\)", "\xf4\x90\x80\x80" }, /* 4-byte sequence, code point U+110000 > U+10FFFF */
{ "\\(.\\)", "\xf5\xa0\xa0\xa0" }, /* 4-byte sequence, code point U+160820 > U+10FFFF */
};
int main()
{
mtrace();
setlocale(LC_ALL, "ru_UA.UTF-8");
// setlocale(LC_ALL, "");
for(size_t test = 0; test < sizeof(tests) / sizeof(tests[0]); test++)
{
printf("--- test %zu\n", test);
const char *pattern = tests[test].pattern;
const char *string = tests[test].string;
struct re_pattern_buffer pattern_buffer;
pattern_buffer.buffer = NULL;
pattern_buffer.allocated = 0;
pattern_buffer.fastmap = NULL;
pattern_buffer.translate = NULL;
pattern_buffer.no_sub = 0;
re_set_syntax(RE_SYNTAX_POSIX_BASIC);
const char *error = re_compile_pattern(pattern, strlen(pattern), &pattern_buffer);
if(error)
{
printf("re_compile_pattern(): %s\n", error);
exit(1);
}
struct re_registers regs;
int errcode = re_match(&pattern_buffer, string, strlen(string), 0, ®s);
if(errcode == -1)
{
printf("no match\n");
}
else if(errcode == -2)
{
printf("internal error\n");
}
else
{
for(size_t i = 0; i < regs.num_regs; i++)
{
size_t length = regs.end[i] - regs.start[i];
char part[length];
strncpy(part, string, length);
part[length] = '\0';
printf("%zu: %d - %d: [%s]\n", i, regs.start[i], regs.end[i], part);
}
}
regfree(&pattern_buffer);
}
}
Reply to: