[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: gawk: Odd regexp matching problem if LANG=ja_JP



At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:

> > Package: gawk
> > Version: 1:3.1.4-1
> 
> > Executing the following line in a shell:
> > 
> >    echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
> > 
> > yields not the expected two lines of output, but instead only the first one:
> > 
> >    --- orig/lisp/ChangeLog
> > 
> > 
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> > 
> >    echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> > 
> > yields:
> > 
> >    --- orig/lisp/ChangeLog
> >    +++ mod/lisp/ChangeLog
> > 
> > 
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
> 
> ko_KR, zh_CN, and zh_TW exhibit the same problem.  On CJK
> locales, this bug causes gawk scripts unusable.
> 
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
> 
> Could anyone fix this bug?

One possible workaround is use GAWK_NO_DFA=1

 % echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP.eucJP GAWK_NO_DFA=1 gawk '/[Cc]hangeLog/ { print }'
 --- orig/lisp/ChangeLog
 +++ mod/lisp/ChangeLog
 

I may find the reason of this bug.  This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.

--- dfa.c~	2004-07-26 23:11:41.000000000 +0900
+++ dfa.c	2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
 	buf_offset = (unsigned char const *)begin - buf_begin;
 	buf_begin = begin;
 	buf_end = end;
 	goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI <ukai@jp.hpl.hp.com> / <ukai@hp.com>
Hewlett-Packard Laboratories Japan		http://ecardfile.com/id/ukai



Reply to: