Bug#555922: libc6: UTF-8 decoding is not conforming to the Unicode standard

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#555922: libc6: UTF-8 decoding is not conforming to the Unicode standard
From: Dmitri Gribenko <gribozavr@gmail.com>
Date: Thu, 12 Nov 2009 18:21:00 +0200
Message-id: <[🔎] 20091112162100.4168.77974.reportbug@epsilon>
Reply-to: Dmitri Gribenko <gribozavr@gmail.com>, 555922@bugs.debian.org

Package: libc6
Version: 2.10.1-5
Severity: normal


libc's decoding of UTF-8 is not conforming to the Unicode standard.  In
particular, it processes:

* 5 and 6-byte sequences, that are not described in the Unicode standard.
* 4-byte sequences that decode to code points above U+10FFFF.
* surrogates U+D800 .. U+DFFF.

Also it doesn't replace ill-formed sequences with replacement characters, nor
it reports an error to the calling program.

All these sequences are ill-formed according to the Unicode standard.

[1], pages 92-94, tables 3-6 and 3-7 define well-formed UTF-8 sequences.

glibc is not going to fix it. See [2].

Nevertheless, such behavior makes glibc and eglibc not conforming to the
Unicode standard.  See [1], pages 59-62, conformance clauses C1, C9, C10.  In
particular, C7 reads:

> All processes and higher-level protocols are required to abide by conformance
> clause C7 at a minimum.

Such not-conforming behavior directly affects all programs that link with libc
and rely on its functions.  In particular:

* sed's regexps can't match overlong byte sequences, continuation bytes that
are not parts of a sequence and first bytes that are not followed by
continuation.

* sed matches 5 and 6-byte sequences and surrogates in UTF-8.

$ printf 'a\xf8\x88\x80\x80\x80b' | sed -e 's/./x/g'
xxx

* the same applies to tac(1) in regexp mode:

$ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s $(printf '\xf8\x88\x80\x80\x80') | xxd -
0000000: 6262 6261 6161 f888 8080 80              bbbaaa.....

$ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s '.' | xxd -
0000000: 6262 62f8 8880 8080 6161 61              bbb.....aaa

* iconv() processes some ill-formed sequences, thus rendering it unusable for
santinizing or validating UTF-8 input.

$ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UTF-8 | xxd -
0000000: f888 8080 80                             .....

$ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UCS-4 | xxd -
0000000: 0020 0000                                . ..

$ echo '<?php print iconv("UTF-8", "UTF-8", "\xf8\x88\x80\x80\x80");' | php | xxd -
0000000: f888 8080 80                             .....

The described behavior is also unsafe in security sense.  There are many
possible scenarios, for example:

Malicious input is processed with glibc's regexps and some ill-formed sequences
pass through.  The programmer expected that output is safe in some sense.  This
result is passed to another program with a UTF-8 decoder that simply skips
ill-formed sequences (thus violating recomendation [3] to never delete
ill-formed sequences).  This can lead to some strings joining unexpectedly in
place where ill-formed sequence was.  Of course, second program is someehat
guilty, but it wasn't expected that it would ever get ill-formed sequences as
input.

Attached is a testcase showing the described behavior for regexps.  The same
set of ill-formed strings can be used to test iconv() and all other mentioned
programs.

In order for these tests and demonstrations to work, please set LC_ALL to some
UTF-8 locale, for example:
$ export LC_ALL=ru_UA.UTF-8

As a summary: if some program wants to process UTF-8 with libc and conform to
Unicode standard, it has to invent some santinizing function that will replace
all ill-formed sequences in the input.  Or it would be easier for the
programmer just to rely on some other library that conforms, for example,
libicu.

[1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
[2] http://sources.redhat.com/bugzilla/show_bug.cgi?id=2373
[3] http://unicode.org/reports/tr36/#UTF-8_Exploit

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-rc6-04nov2009 (SMP w/2 CPU cores)
Locale: LANG=ru_UA.UTF-8, LC_CTYPE=ru_UA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages libc6 depends on:
ii  libc-bin                      2.10.1-5   GNU C Library: Binaries
ii  libgcc1                       1:4.4.1-4  GCC support library

libc6 recommends no packages.

Versions of packages libc6 suggests:
ii  debconf [debconf-2.0]         1.5.28     Debian configuration management sy
ii  glibc-doc                     2.10.1-5   GNU C Library: Documentation
ii  locales                       2.10.1-5   GNU C Library: National Language (

-- debconf information excluded

/*
 * Compile with gcc -W -Wall -Werror -std=c99
 */

#define _GNU_SOURCE 1

#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <mcheck.h>

static struct
{
  const char *pattern;
  const char *string;
} tests[] =
{
  /*
   * No match.
   */
  { "\\(.\\)", "\xc0\xaf" },            /* overlong 2-byte sequence for U+002F */
  { "\\(.\\)", "\xe0\x80\xaf" },        /* overlong 3-byte sequence for U+002F */
  { "\\(.\\)", "\xf0\x80\x80\xaf" },    /* overlong 4-byte sequence for U+002F */

  /* continuation byte that is not part of a sequence */
  { "\\(.\\)", "\x80" },
  { "\\(.\\)", "\x90" },
  { "\\(.\\)", "\xaa" },
  { "\\(.\\)", "\xbf" },

  /* first byte that is not followed by a continuation.  \x61 -- 'A' */
  { "\\(.\\)", "\xc2\x61" },            /* 2-byte sequence */
  { "\\(.\\)", "\xe0\x61" },            /* 3-byte sequence */
  { "\\(.\\)", "\xf0\x61" },            /* 4-byte sequence */

  /*
   * Matches, but no substitution.
   */
  /* UTF-8 only defines 1, 2, 3 and 4-byte sequences. */
  { "\\(.\\)", "\xf8\x88\x80\x80\x80" },     /* 5-byte sequence, U+200000 */
  { "\\(.\\)", "\xfc\x84\x80\x80\x80\x80" }, /* 6-byte sequence, U+4000000 */

  { "\\(.\\)", "\xed\xa0\x80" },        /* higher surrogate, U+D800 */
  { "\\(.\\)", "\xed\xa0\x91" },        /* higher surrogate, U+D811 */
  { "\\(.\\)", "\xed\xaf\xbf" },        /* higher surrogate, U+DBFF */
  { "\\(.\\)", "\xed\xb0\x80" },        /* lower surrogate,  U+DC00 */
  { "\\(.\\)", "\xed\xb0\x91" },        /* lower surrogate,  U+DC11 */
  { "\\(.\\)", "\xed\xbf\xbf" },        /* lower surrogate,  U+DFFF */
  { "\\(.\\)", "\xed\xa0\x80\xed\xb0\x80" }, /* paired surrogates, U+D800 + U+DC00 = U+10000 */
  { "\\(.\\)", "\xf4\x90\x80\x80" },    /* 4-byte sequence, code point U+110000 > U+10FFFF */
  { "\\(.\\)", "\xf5\xa0\xa0\xa0" },    /* 4-byte sequence, code point U+160820 > U+10FFFF */
};

int main()
{
  mtrace();

  setlocale(LC_ALL, "ru_UA.UTF-8");
//  setlocale(LC_ALL, "");

  for(size_t test = 0; test < sizeof(tests) / sizeof(tests[0]); test++)
  {
    printf("--- test %zu\n", test);

    const char *pattern = tests[test].pattern;
    const char *string = tests[test].string;

    struct re_pattern_buffer pattern_buffer;
    pattern_buffer.buffer = NULL;
    pattern_buffer.allocated = 0;
    pattern_buffer.fastmap = NULL;
    pattern_buffer.translate = NULL;
    pattern_buffer.no_sub = 0;

    re_set_syntax(RE_SYNTAX_POSIX_BASIC);

    const char *error = re_compile_pattern(pattern, strlen(pattern), &pattern_buffer);
    if(error)
    {
      printf("re_compile_pattern(): %s\n", error);
      exit(1);
    }

    struct re_registers regs;
    int errcode = re_match(&pattern_buffer, string, strlen(string), 0, &regs);
    if(errcode == -1)
    {
      printf("no match\n");
    }
    else if(errcode == -2)
    {
      printf("internal error\n");
    }
    else
    {
      for(size_t i = 0; i < regs.num_regs; i++)
      {
        size_t length = regs.end[i] - regs.start[i];
        char part[length];
        strncpy(part, string, length);
        part[length] = '\0';
        printf("%zu: %d - %d: [%s]\n", i, regs.start[i], regs.end[i], part);
      }
    }

    regfree(&pattern_buffer);
  }
}

Reply to:

Follow-Ups:
- Bug#555922: libc6: UTF-8 decoding is not conforming to the Unicode standard
  - From: Dmitri Gribenko <gribozavr@gmail.com>

Prev by Date: Bug#520399: marked as done (ldd /usr/bin/reportbug gives strange error if libc6-amd64 installed)
Next by Date: r3962 - glibc-package/trunk/debian
Previous by thread: Bug#520399: marked as done (ldd /usr/bin/reportbug gives strange error if libc6-amd64 installed)
Next by thread: Bug#555922: libc6: UTF-8 decoding is not conforming to the Unicode standard
Index(es):
- Date
- Thread