Bug#344146: regex_internal.c bug (Re: Bug#344146: towupper(3) bug? (Re: re_search(3) dumps core))
At Sat, 24 Dec 2005 01:17:55 +0900,
Fumitoshi UKAI wrote:
> > It is a bug in libc6, not in grep.
> > grep 2.3.1.ds2-4 works fine on libc6 2.3.2.ds1-22 if I rebuilt on sarge.
>
> > It seems some problem in posix/regex_internal.c:build_wcs_upper_buffer().
> >
> > % LANG=ja_JP.EUC-JP gdb ./a.out
> > GNU gdb 6.4-debian
> > Copyright 2005 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and you are
> > welcome to change it and/or distribute copies of it under certain conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB. Type "show warranty" for details.
> > This GDB was configured as "i486-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".
> >
> > (gdb) run
> > Starting program: /tmp/a.out
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > 0xb7f1920f in memcpy () from /lib/tls/libc.so.6
> > (gdb) bt
> > #0 0xb7f1920f in memcpy () from /lib/tls/libc.so.6
> > #1 0xb7f4a07a in build_wcs_upper_buffer () from /lib/tls/libc.so.6
> > #2 0xb7f4a335 in re_string_reconstruct () from /lib/tls/libc.so.6
> > #3 0xb7f5bde7 in re_search_internal () from /lib/tls/libc.so.6
> > #4 0xb7f5ea89 in re_search_stub () from /lib/tls/libc.so.6
> > #5 0xb7f5ef63 in re_search () from /lib/tls/libc.so.6
> > #6 0x08048618 in main (argc=1, argv=0xbffffaf4) at rtest.c:28
> > (gdb)
>
> I investigated this more on this:
>
> * input multi byte sequence is "\x8f\xa9\xc3", which is
> LATIN SMALL LETTER ETH in EUC-JP encoding.
>
> * if RE_ICASE is used in re_syntax, re_search tries to convert
> characters to be upper case by build_wcs_upper_buffer().
>
> * when multibyte sequence "\x8f\xa9\xc3" in EUC-JP is converted to
> wide character, we'll get 0x00F0 (LATAIN SMALL LETTER ETH; U00F0).
>
> * This wide character (LATIN SMALL LETTER ETH; U00F0) is lower case,
> so we need to towupper() this.
>
> * when towupper() this wide character (LATIN SMALL LETTER ETH; U00F0),
> we'll get wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0).
>
> * when wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0) back to
> multibyte sequence in EUC-JP, it fails, so wcrtomb() returns (size_t)(-1).
> (there are no valid byte sequence to represent LATIN CAPITAL LETTER ETH;
> U00D0 in EUC-JP encoding).
>
> * however, build_wcs_upper_buffer() doesn't care this case.
> it assumes mbrtowc -> towupper -> wcrtomb always success and only care
> the case that lengths of multibyte sequences would be different.
It seems this bug has been fixed on posix/regex_internal.c 1.52 (and 1.41.2.7)
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.c.diff?r1=1.51&r2=1.52&cvsroot=glibc
Regards,
Fumitoshi UKAI
Reply to: