[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#712592: Problem is strcoll()




I am hoping that sending this along is OK - not sure how to proceed.
Please let me know if I should append this to the original bug report, open a new bug, or otherwise.

Further research shows that bash is behaving very badly with bracket matching.
HOWEVER, the problem is apparently with strcoll().

I built bash with some debug printfs -
 one prior to entering the second rangecmp test in lib/glob/sm_loop.c (~467):
-----------------------------------------------------------------------
printf("enter rangecmp x 2; cstart %x, cend %x, test %x\n", cstart, cend, test);
      if (RANGECMP (test, cstart) >= 0 && RANGECMP (test, cend) <= 0) {
        goto matched;
      }
-----------------------------------------------------------------------
 another prior to strcoll() in lib/glob/smatch.c[rangecmp()](~72):
-----------------------------------------------------------------------
  s1[0] = a1;
  s2[0] = a2;
printf ("++> strcoll(a, b) a = %s (0x%x), b = %s (0x%x), result = %x\n", s1, s1[0], s2, s2[0], strcoll (s1, s2));
  if ((ret = strcoll (s1, s2)) != 0) {
    return ret;
  }
-----------------------------------------------------------------------
Running the resultant shell with LC_COLLATE=en_US.UTF-8 in the source root for bash -
and executing "ls [HIJKLMNO]*" from within that shell, a portion of output:
-----------------------------------------------------------------------
enter rangecmp x 2; cstart 48, cend 48, test 6c
++> strcoll(a, b) a = l (0x6c), b = H (0x48), result = 4
++> strcoll(a, b) a = l (0x6c), b = H (0x48), result = 4
enter rangecmp x 2; cstart 49, cend 49, test 6c
++> strcoll(a, b) a = l (0x6c), b = I (0x49), result = 3
++> strcoll(a, b) a = l (0x6c), b = I (0x49), result = 3
enter rangecmp x 2; cstart 4a, cend 4a, test 6c
++> strcoll(a, b) a = l (0x6c), b = J (0x4a), result = 2
++> strcoll(a, b) a = l (0x6c), b = J (0x4a), result = 2
enter rangecmp x 2; cstart 4b, cend 4b, test 6c
++> strcoll(a, b) a = l (0x6c), b = K (0x4b), result = 1
++> strcoll(a, b) a = l (0x6c), b = K (0x4b), result = 1
enter rangecmp x 2; cstart 4c, cend 4c, test 6c
++> strcoll(a, b) a = l (0x6c), b = L (0x4c), result = fffffff9
enter rangecmp x 2; cstart 4d, cend 4d, test 6c
++> strcoll(a, b) a = l (0x6c), b = M (0x4d), result = ffffffff
enter rangecmp x 2; cstart 4e, cend 4e, test 6c
++> strcoll(a, b) a = l (0x6c), b = N (0x4e), result = fffffffe
enter rangecmp x 2; cstart 4f, cend 4f, test 6c
++> strcoll(a, b) a = l (0x6c), b = O (0x4f), result = fffffffd
-----------------------------------------------------------------------
Note that the result from strcoll for the test of "l" against "L", which I would have thought should
return "0" (and, by the way, the sequence of return values in this output seems to bear that out logically),
actually returns "fffffff9".  That value is returned consistently with lowercase compared against upper case equiv.
The ONLY time I see "0" returned from strcoll() is when the char is an EXACT match (with case).

It's been a while since I messed with this stuff - if I can remember how to build a debug version of libc and
run some tests with it, I will.  Maybe I am doing something wrong, but the bash response is clearly bad.
Some quick examples of total inconsistencies (using out of the box bash - and locale is inconsequential unless C):
-----------------------------------------------------------------------
root@debmicro:/usr/local/src/bash-4.2# ls [K-M]*
list.c      lsignames.h  make_cmd.c  Makefile.in     mksignames.o
list.o      mailcheck.c  make_cmd.h  MANIFEST     mksyntax
locale.c  mailcheck.h  make_cmd.o  MANIFEST.doc  mksyntax.c
locale.o  mailcheck.o  Makefile    mksignames

lib:
glob  intl  malloc  readline  sh  termcap  tilde
root@debmicro:/usr/local/src/bash-4.2#
----------------------------------------------------------------------
root@debmicro:/usr/local/src/bash-4.2# ls [k-m]*
list.c      locale.o     mailcheck.h  make_cmd.h    mksignames.o
list.o      lsignames.h  mailcheck.o  make_cmd.o    mksyntax
locale.c  mailcheck.c  make_cmd.c   mksignames    mksyntax.c

lib:
glob  intl  malloc  readline  sh  termcap  tilde
root@debmicro:/usr/local/src/bash-4.2#
----------------------------------------------------------------------
root@debmicro:/usr/local/src/bash-4.2# ls [KLM]*
Makefile  Makefile.in  MANIFEST  MANIFEST.doc
root@debmicro:/usr/local/src/bash-4.2#
----------------------------------------------------------------------
root@debmicro:/usr/local/src/bash-4.2# ls [klm]*
list.c      locale.o     mailcheck.h  make_cmd.h    mksignames.o
list.o      lsignames.h  mailcheck.o  make_cmd.o    mksyntax
locale.c  mailcheck.c  make_cmd.c   mksignames    mksyntax.c

lib:
glob  intl  malloc  readline  sh  termcap  tilde
root@debmicro:/usr/local/src/bash-4.2#
-----------------------------------------------------------------------
Clearly, we should get the same results for [K-M]* and [KLM]*, but they differ wildly.
If the system should be ignoring cases, the differences between [K-M]* and [k-m]* are problematic.
It seems that if lower case is used, the matches come out to be as one might expect (*without* ignoring case).
This actually has to do with the way strcoll() is behaving and the way the rangecmp call is set up in sm_loop.c.

I am mainly concerned with this since I don't know how severe the effects of this strcoll() behavior could be.

Apologies if being a nuisance - just trying to help out a bit.

Thanks and regards -
Bruce.


From: Jonathan Nieder <jrnieder@gmail.com>
To: Bruce Gayliard <brucegayliard@yahoo.com>
Cc: 712592@bugs.debian.org
Sent: Tuesday, June 18, 2013 12:38 AM
Subject: Re: Looks like the underlying issue is the default locale

reassign 712592 libc6
forcemerge 333953 712592
quit

Hi Bruce,

Bruce Gayliard wrote:

> After doing a little research on this I found that strcoll(),
> called at the end of rangecmp(), was treating lower and
> upper cases equally.
> It appears that the default locale, en_US.UTF-8, is the real
> culprit.

Thanks for investigating.  Merging with a related report.

Hope that helps,
Jonathan



Reply to: