Bug#344146: towupper(3) bug? (Re: re_search(3) dumps core)

To: 344146@bugs.debian.org
Subject: Bug#344146: towupper(3) bug? (Re: re_search(3) dumps core)
From: Fumitoshi UKAI <ukai@debian.or.jp>
Date: Sat, 24 Dec 2005 01:17:55 +0900
Message-id: <[🔎] 87bqz77pos.wl%ukai@debian.or.jp>
Reply-to: Fumitoshi UKAI <ukai@debian.or.jp>, 344146@bugs.debian.org
In-reply-to: <87ek447wk0.wl%ukai@debian.or.jp>
References: <87ek447wk0.wl%ukai@debian.or.jp>

At Fri, 23 Dec 2005 04:37:19 +0900,
Fumitoshi UKAI wrote:
 
> reassign 344146 libc6 2.3.5-8.1
> retitle 344146 re_search(3) dumps core
> thanks
> 
> It is a bug in libc6, not in grep.
> grep 2.3.1.ds2-4 works fine on libc6 2.3.2.ds1-22 if I rebuilt on sarge. 

> It seems some problem in posix/regex_internal.c:build_wcs_upper_buffer().
> 
> % LANG=ja_JP.EUC-JP gdb ./a.out
> GNU gdb 6.4-debian
> Copyright 2005 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i486-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".
> 
> (gdb) run
> Starting program: /tmp/a.out
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0xb7f1920f in memcpy () from /lib/tls/libc.so.6
> (gdb) bt
> #0  0xb7f1920f in memcpy () from /lib/tls/libc.so.6
> #1  0xb7f4a07a in build_wcs_upper_buffer () from /lib/tls/libc.so.6
> #2  0xb7f4a335 in re_string_reconstruct () from /lib/tls/libc.so.6
> #3  0xb7f5bde7 in re_search_internal () from /lib/tls/libc.so.6
> #4  0xb7f5ea89 in re_search_stub () from /lib/tls/libc.so.6
> #5  0xb7f5ef63 in re_search () from /lib/tls/libc.so.6
> #6  0x08048618 in main (argc=1, argv=0xbffffaf4) at rtest.c:28
> (gdb)

I investigated this more on this:

 * input multi byte sequence is "\x8f\xa9\xc3", which is
   LATIN SMALL LETTER ETH in EUC-JP encoding.

 * if RE_ICASE is used in re_syntax, re_search tries to convert
   characters to be upper case by build_wcs_upper_buffer().

 * when multibyte sequence "\x8f\xa9\xc3" in EUC-JP is converted to 
   wide character, we'll get 0x00F0 (LATAIN SMALL LETTER ETH; U00F0).

 * This wide character (LATIN SMALL LETTER ETH; U00F0) is lower case,
   so we need to towupper() this.

 * when towupper() this wide character (LATIN SMALL LETTER ETH; U00F0), 
   we'll get wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0).

 * when wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0) back to
   multibyte sequence in EUC-JP, it fails, so wcrtomb() returns (size_t)(-1).
   (there are no valid byte sequence to represent LATIN CAPITAL LETTER ETH;
   U00D0 in EUC-JP encoding).

 * however, build_wcs_upper_buffer() doesn't care this case.
   it assumes mbrtowc -> towupper -> wcrtomb always success and only care
   the case that lengths of multibyte sequences would be different.

I'm not sure, but towupper(3) should not return wide character that
can't be represented in current locale encoding.
The Single UNIX Specification, Version 2 says:

    If the argument of towupper() represents a
    lower-case wide-character code, and there exists a corresponding upper-case
    wide-character code (as defined by character type information in the
    program locale category LC_CTYPE), the result is the corresponding
    upper-case wide-character code.

http://www.opengroup.org/onlinepubs/007908799/xsh/towupper.html

In this case, 

 * the argument of towupper() represents a lower-case wide character code
   0x00F0 (LATAIN SMALL LETTER ETH; U00F0)

 * but, there DOESN'T exist a corresponding upper-case wide-character code 
   (as defined by character type information in the program locale category 
    LC_CTYPE)
  upper-case wide-characeter code of (LATAIN SMALL LETTER ETH; U00F0) would
  be (LATIN CAPITAL LETTER ETH; U00D0), but there doesn't exist in
  EUC-JP encoding.

% cat wupper-test.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

int
main(int argc, char *argv[])
{
    mbstate_t st;
    unsigned char buf[] = "\x8f\xa9\xc3";
    unsigned char obuf[10];
    wchar_t wc;
    wint_t wcu;
    size_t s;
    memset(&st, 0, sizeof(st));

    setlocale(LC_ALL, "");

    s = mbrtowc(&wc, (const char *)buf, sizeof(buf), &st);
    printf("mb:[%02x %02x %02x %02x] => len:%d wc: %04x\n", 
	   buf[0], buf[1], buf[2], buf[3], s, wc);

    memset(obuf, 0, sizeof(obuf));
    s = wcrtomb((char *)obuf, wc, &st);
    printf("wc %04x => len:%d mb:[%02x %02x %02x %02x]\n", 
	   wc, s, obuf[0], obuf[1], obuf[2], obuf[3]);

    wcu = towupper(wc);
    printf("wc:%04x => wcu:%04x\n", wc, wcu);
    memset(obuf, 0, sizeof(obuf));
    s = wcrtomb((char *)obuf, (wchar_t)wcu, &st);
    printf("wc %04x => len:%d mb:[%02x %02x %02x %02x]\n", 
	   wcu, s, obuf[0], obuf[1], obuf[2], obuf[3]);
    
    exit(0);
}
% cc -o wupper-test wupper-test.c
% LANG=ja_JP.EUC-JP ./wupper-test
mb:[8f a9 c3 00] => len:3 wc: 00f0
wc 00f0 => len:3 mb:[8f a9 c3 00]
wc:00f0 => wcu:00d0
wc 00d0 => len:-1 mb:[00 00 00 00]

Regards,
Fumitoshi UKAI

Reply to:

Follow-Ups:
- Bug#344146: regex_internal.c bug (Re: Bug#344146: towupper(3) bug? (Re: re_search(3) dumps core))
  - From: Fumitoshi UKAI <ukai@debian.or.jp>

Prev by Date: Bug#344481: Mysterious crash in php4-rrdtool
Next by Date: Re: Bug#343140: libc6: resolver always checks search list in /etc/resolv.conf
Previous by thread: Bug#153022: Private Invite From Kristy Please join us
Next by thread: Bug#344146: regex_internal.c bug (Re: Bug#344146: towupper(3) bug? (Re: re_search(3) dumps core))
Index(es):
- Date
- Thread