--- Begin Message ---
- To: "Debian Bug Tracking System" <submit@bugs.debian.org>
- Subject: libc6: mbrtowc bug with incomplete wide characters
- From: "Daniel Jacobowitz" <dan@debian.org>
- Date: Sat, 17 Aug 2002 16:26:29 -0400
- Message-id: <E17gA9W-0002kl-00@nevyn.them.org>
Package: libc6
Version: 2.2.5-13
Severity: normal
Tags: upstream
Some characters (in the Thai character?) set can not be resumed if they are
partially parsed. The problem can be reproduced by (LC_ALL set to
en_US.UTF-8):
char *str4 = "\xe0\xb8\xb1";
int bar(char *str)
{
mbstate_t ps;
wchar_t wc;
int j;
memset (&ps, 0, sizeof(ps));
ps.__value.__wch = 3584;
j = mbrtowc (&wc, str, 1, &ps);
j = mbrtowc (&wc, str+1, 2, &ps);
return j;
}
int main(int argc, char **argv, char **env)
{
setlocale(LC_ALL, "");
bar(str4);
}
The character parses correctly from that shift state if the whole string is
given at once:
(gdb) p ps
$8 = {__count = 0, __value = {__wch = 3584, __wchb = "\0\016\0"}}
(gdb) p mbrtowc(&wc, str, 3, &ps)
$9 = 3
(gdb) p ps
$10 = {__count = 0, __value = {__wch = 3584, __wchb = "\0\016\0"}}
(gdb) p mbrtowc(&wc, str, 1, &ps)
$11 = -2
(gdb) p mbrtowc(&wc, str+1, 2, &ps)
$12 = -1
I don't know the sequence to reach that shift state, but it's in M. Kuhn's
UTF-8-demo.txt file, a standard UTF-8 test.
-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux nevyn 2.4.19-pre10-ac2-drow #4 SMP Sun Jun 16 12:01:20 EDT 2002 i686
Locale: LANG=en_US, LC_CTYPE=
-- no debconf information
--- End Message ---