[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#187991: bug#1536: mutt-1.5.4i: Segment fault with long lines when LANG=*.UTF-8



On Sun, 2003-04-27 at 19:23:32 +0200, Adrian Bunk wrote:
> This might be related to Debian bug #187991 (grep 2.5.1 segfaults in
> UTF-8 locale) [1]?

I am not sure if this mutt-bug is related to #187991. I have made some
further investigations, which indicates that it is rather related to a
trailing incomplete multibyte character.

My conclusions are:

The segfault may occurs when
  LC_ALL, LC_COLLATE or LANG is set to *.UTF-8.
and
  the line contains more than 1022 characters after conversion to UTF-8,
  which means 341 Chinese characters.
and
  color command includes a collated pattern, for instance
  "color body magenta default [a-z]"

In display_line (pager.c) fill_buffer first fills the buffer, buf, which
is then passed to resolve_types, which in turn calls regexec. If buf
contains trailing incomplete multibyte characters, this may cause
regexec to segfault in find_collation_sequence_value.

Note that long lines in Chinese is not uncommon since one paragraph is
often written as one line without spaces.

Note also that the problem can be avoided by replacing [a-z] with
[[:alpha:]] in regexps.

I have written a simple patch that modifies fill_buffer so that any
incomplete multibyte characters are trimmed off. The patch, which is
enclosed, is relative to mutt 1.5.4 and was produced with "diff -Nur".

Below I give some additional information on the segfault.

I am using libc6 (2.3.1-16).

After downloading libc6-dbg (2.3.1-16) I was able to produce the
following backtrace:

With mutt-utf8_1.5.4-1_i386.deb

(gdb) bt
#0  0x401ddd47 in find_collation_sequence_value (
    mbs=0x815b788 "年年\024\b\020", mbs_len=3) at ../locale/weight.h:3681
#1  0x401dd9ac in check_node_accept_bytes (preg=0x8128cf8, node_idx=981024,
    input=0xbfffd500, str_idx=0) at regexec.c:3553
#2  0x401dba79 in transit_state_mb (preg=0x8128cf8, pstate=0x81299d0,
    mctx=0xbfffd4c0) at regexec.c:2330
#3  0x401db5da in transit_state (err=0xbfffd468, preg=0x8128cf8,
    mctx=0xbfffd4c0, state=0x81299d0, fl_search=0) at regexec.c:2092
#4  0x401d9ce3 in check_matching (preg=0x8128cf8, mctx=0xbfffd4c0,
    fl_search=0, fl_longest_match=1) at regexec.c:1034
#5  0x401d97e9 in re_search_internal (preg=0x8128cf8,
    string=0xbfffd670 "�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�\212年�"..., length=1022, start=0, range=1022, stop=981024, nmatch=1,
    pmatch=0xbfffd5e0, eflags=0) at regexec.c:769
#6  0x401d8e31 in __regexec (preg=0x8128cf8,
    string=0xef888 <Address 0xef888 out of bounds>, nmatch=1,
    pmatch=0xbfffd5e0, eflags=0) at regexec.c:249
#7  0x08083d2c in mx_check_empty ()
#8  0x080851aa in mx_check_empty ()
#9  0x080857bd in mutt_pager ()
#10 0x0805e5b4 in mutt_display_message ()
#11 0x08067264 in mutt_index_menu ()
#12 0x0807a111 in main ()
#13 0x40146a51 in __libc_start_main (main=0x8079564 <main>, argc=1,
    ubp_av=0xbffff9a4, init=0x8054700 <_init>, fini=0x80bac08 <_fini>,
    rtld_fini=0x400098bc <_dl_fini>, stack_end=0xef888)
    at ../sysdeps/generic/libc-start.c:147
(gdb)

It seems that idx is running out of bounds in
find_collation_sequence_value since the extra string is a null string.
extra = _NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB);

In the call to resolve_types the argument buf is truncated and the last
3-byte symbol has lost it last byte.

The reason seems to lie in fill_buffer, which may truncate the input
string so that the last UTF-8 symbol is incomplete.

fill_buffer (FILE *f, long *last_pos, long offset, unsigned char *buf, 
     unsigned char *fmt, size_t blen, int *buf_ready)
     {

Anders
--- pager.c.orig	Mon May 26 16:56:10 2003
+++ pager.c	Mon May 26 19:32:25 2003
@@ -971,7 +971,11 @@
 {
   unsigned char *p;
   static int b_read;
-
+  
+  size_t k, n;
+  wchar_t wc;
+  mbstate_t mbstate;
+  
   if (*buf_ready == 0)
   {
     buf[blen - 1] = 0;
@@ -986,6 +990,15 @@
     b_read = (int) (*last_pos - offset);
     *buf_ready = 1;
 
+    /* trim tail of buf so that it contains complete multibyte characters */
+    memset (&mbstate, 0, sizeof (mbstate));
+    for (n = b_read, p = buf;
+         n > 0 && (k = mbrtowc (&wc, (char *) p, n, &mbstate)); 
+         p += k, n -= k) if (k == -1) k = 1; 
+                         else if (k == -2) break;
+    b_read -= n;
+    buf[b_read] = 0; 
+    
     /* copy "buf" to "fmt", but without bold and underline controls */
     p = buf;
     while (*p)

Reply to: