[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#702617: regex /./ fails to match certiain characters



On Mon, Mar 11, 2013 at 06:13:07PM +0100, Joachim Breitner wrote:
> Control: reassign -1 libc6
> Control: found libc6/2.13-38
> Control: affects haskell-regex-compat
> 
> Hi,
> 
> Am Montag, den 11.03.2013, 11:53 -0400 schrieb Joey Hess:
> > Joachim Breitner wrote:
> > > I can reproduce it from within ghc’s address space using gdb:
> > > 
> > > (gdb) call malloc(32)
> > > $7 = 64943120
> > > (gdb) call regcomp(64943120, ".", 0)
> > > $8 = 0
> > > (gdb) call regexec(64943120,"\242",0,0,0)
> > > $9 = 1
> > > (gdb) call regexec(64943120,"only_ascii",0,0,0)
> > > $10 = 0
> > > 
> > > And even from gdb while debugging “sleep”. So the behaviour is already
> > > there in regexec, but for some reason it is not triggered from C code,
> > > but only via some variants of FFI (GHC’s or gdb’s).
> > > 
> > > I’ll leave it at that, as this is not really related to GHC or Haskell
> > > any more.
> > 
> > That's some deep dive!
> > 
> > Sounds like a reassign to glibc is in order?
> 
> if you think so...
> 
> @glibc maintainers, here is the short story:
> 
> This code prints, as expected 0 (for regex matches):
> 
> #include <sys/types.h>
> #include <regex.h>
> #include <stdio.h>
> 
> main () {
> 	regex_t r;
> 	regcomp(&r, ".", 0);
> 	char *s = "\242";
> 	int i = regexec(&r, s, 0, NULL, 0);
> 	printf("%d\n", i);
> }
> 
> But in some circumstances, this does not work as expected. One such
> circumstance is Haskell code doing this via the FFI, but also from gdb:

What you see is actually very likely locale related. The "\242"
character is not valid in unicode locale. If you run your code using a
unicode locale, as regcomp() and regexec() interpret the regex and the
string as unicode, the "\242" character is ignored.

The behavior you describe can be reproduced in you C example by adding 
a call to setlocale(LC_ALL, "C.UTF-8") at the beginning of your code.

When you test with FFI or from GDB, it is very likely you have a
unicode locale defined.

> (gdb) call malloc(32)
> $1 = 6332464
> (gdb) call memset(6332464, 0, 32)
> $3 = 6332464
> (gdb) call regcomp(6332464, ".", 0)
> $4 = 0
> (gdb) call regexec(6332464, "\242",0,0,0)
> $5 = 1

"\242" is ignored as it is a unicode character.

> It fails if there are no ascii characters around:
> 
> (gdb) call regexec(6332464, "\242x",0,0,0)
> $6 = 0

Here only the "x" is matched.

> (gdb) call regexec(6332464, "\242\242",0,0,0)
> $7 = 1

No valid unicode character, nothing is matched.

> (gdb) call regexec(6332464, "only_ascii",0,0,0)
> $8 = 0

Here the "o" is matched.


I am therefore tempted to reassign the bug back to
libghc-regex-compat-dev. Do you agree?

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net


Reply to: