[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#157086: libc6: mbrtowc bug with incomplete wide characters



Package: libc6
Version: 2.2.5-13
Severity: normal
Tags: upstream

Some characters (in the Thai character?) set can not be resumed if they are
partially parsed.  The problem can be reproduced by (LC_ALL set to
en_US.UTF-8):

char *str4 = "\xe0\xb8\xb1";
int bar(char *str)
{ 
  mbstate_t ps;
  wchar_t wc;
  int j;
  memset (&ps, 0, sizeof(ps));
  ps.__value.__wch = 3584;
  j = mbrtowc (&wc, str, 1, &ps);
  j = mbrtowc (&wc, str+1, 2, &ps);
  return j;
}
int main(int argc, char **argv, char **env)
{ 
  setlocale(LC_ALL, "");

  bar(str4);
}

The character parses correctly from that shift state if the whole string is
given at once:

(gdb) p ps
$8 = {__count = 0, __value = {__wch = 3584, __wchb = "\0\016\0"}}
(gdb) p mbrtowc(&wc, str, 3, &ps)
$9 = 3
(gdb) p ps
$10 = {__count = 0, __value = {__wch = 3584, __wchb = "\0\016\0"}}
(gdb) p mbrtowc(&wc, str, 1, &ps)
$11 = -2
(gdb) p mbrtowc(&wc, str+1, 2, &ps)
$12 = -1


I don't know the sequence to reach that shift state, but it's in M. Kuhn's
UTF-8-demo.txt file, a standard UTF-8 test.

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux nevyn 2.4.19-pre10-ac2-drow #4 SMP Sun Jun 16 12:01:20 EDT 2002 i686
Locale: LANG=en_US, LC_CTYPE=

-- no debconf information




Reply to: