Bug#487230: Processed: kimiwitu

To: 487230@bugs.debian.org
Cc: Michael Piefel <piefel@debian.org>
Subject: Bug#487230: Processed: kimiwitu
From: Daniel Burrows <dburrows@debian.org>
Date: Sat, 21 Jun 2008 10:01:06 -0700
Message-id: <[🔎] 20080621170106.GA32514@alpaca>
Reply-to: Daniel Burrows <dburrows@debian.org>, 487230@bugs.debian.org
In-reply-to: <20080621041430.GA19008@alpaca>
References: <485BFAD7.3050107@debian.org> <handler.s.C.121398788515463.transcript@bugs.debian.org> <20080621041430.GA19008@alpaca>

On Fri, Jun 20, 2008 at 09:14:30PM -0700, Daniel Burrows <dburrows@debian.org> was heard to say:
>   While it was a good guess, I don't think this is an aptitude bug.

  I've tracked down the problem.  apt recodes Descriptions to the local
codeset when client code invokes LongDesc().  However, it just tries to
use one iconv() call to perform the conversion.  This is a problem if
there are characters in the string that can't be translated.

  I suggest something like the attached patch, which imports the
transcoding routine that I use in aptitude.  When an error is
encountered, this routine will attempt to start decoding again at the
next byte, which produces reasonable results when decoding UTF8 strings.
For instance, the description of the package in question now starts out
like this:

Description: Compiler development tool, complementary to lex and yacc
 Kimwitu (pronounced kee'mweetoo) is a system that supports
 the construction of programs that use trees or terms as
 their main data structure. It is a ???meta-tool?? in the
 development process of tools.

  which is not ideal, but is better than just truncating everything past
the first encoding error.

  Daniel

=== modified file 'apt-pkg/contrib/strutl.cc'
--- apt-pkg/contrib/strutl.cc	2008-04-02 16:06:49 +0000
+++ apt-pkg/contrib/strutl.cc	2008-06-21 16:56:25 +0000
@@ -36,6 +36,182 @@
 using namespace std;
 									/*}}}*/
 
+namespace
+{
+  /** Does the dirty iconv work, given that an iconv session has been
+   *  opened and we want to fully decode the "inbuf".  If the outbuf
+   *  isn't large enough, it will be repeatedly doubled.
+   *
+   *  \param state the iconv state to be used
+   *
+   *  \param outbuf the buffer to which the string should be decoded.
+   *         If \b null, a new buffer will be allocated.
+   *
+   *  \param outbufsize the initial size of "outbuf", updated if
+   *      outbuf is increased.  If this value is 0, an arbitrary small
+   *      starting value will be used.
+   *
+   *  \param inbuf the string to be decoded.
+   *
+   *  \param inbufsize the size of inbuf.
+   *
+   *  \param decoded location to write the number of bytes in the decoded string.
+   *
+   *  \param errf a callback to handle encoding errors: it is passed the
+   *  current decoding state, and returns 'true' to continue and 'false'
+   *  to abort (after possibly adjusting said state).
+   *
+   *    I originally wrote this code for cwidget, but it's also useful
+   *    in apt to prevent coding errors from truncating strings as in
+   *    bug #487230. -- dburrows
+   */
+  bool transcode_buffer(iconv_t &state,
+			char *&outbuf,
+			size_t &outbufsize,
+			const char *inbuf,
+			size_t inbufsize,
+			size_t &decoded,
+			const char *outencoding)
+  {
+    bool rval = true;
+
+    if(outbufsize == 0 || outbuf == NULL)
+      {
+	free(outbuf);
+	// arbitrary initial starting size; expected to be large enough
+	// for most "small" strings.
+	if(outbufsize == 0)
+	  outbufsize = 1024;
+	outbuf = (char *) malloc(outbufsize);
+	if(outbuf == NULL)
+	  {
+	    errno = ENOMEM;
+	    decoded = 0;
+	    return false;
+	  }
+      }
+
+    char *outbufcur = outbuf;
+
+    size_t outremaining = outbufsize;
+    size_t inremaining  = inbufsize;
+
+    while(inremaining > 0)
+      {
+	if(iconv(state,
+		 const_cast<char **>(&inbuf), &inremaining,
+		 &outbufcur, &outremaining) == ((size_t)-1))
+	  {
+	    // Some error conditions can be corrected.  There are three
+	    // reasons iconv can terminate abnormally:
+	    //
+	    //  (1) an invalid multibyte sequence occured.  We do not
+	    //      attempt to recover in this case.
+	    //
+	    //  (2) an incomplete multibyte sequence occured; as the
+	    //      input string is all the input we have, this reduces
+	    //      to case (1).
+	    //
+	    //  (3) no room left in the output buffer.  We respond by
+	    //      doubling the output buffer's size, or failing if
+	    //      it's doubled as far as it can go.
+	    //
+	    //  Note that by "not recovering" I mean that we reset the
+	    //  iconv state to its initial state, output a question
+	    //  mark, and try to start decoding from the next byte.
+	    //  This is an approximate solution to the problem, but
+	    //  seems to work well in practice for things like UTF8 ->
+	    //  ASCII.
+
+	    if(errno != E2BIG)
+	      {
+		rval=false;
+		// Reset the output to initial state.
+		size_t result = iconv(state, NULL, NULL, &outbufcur, &outremaining);
+
+		while(result == (size_t)(-1))
+		  {
+		    size_t idx = outbufcur-outbuf;
+		    outremaining += outbufsize;
+		    outbufsize *= 2;
+		    outbuf = (char *) realloc(outbuf,outbufsize);
+		    outbufcur = outbuf+idx;
+
+		    result = iconv(state, NULL, NULL, &outbufcur, &outremaining);
+		  }
+
+		// Open a *new* iconv to spit a '?' onto the decoded
+		// output.
+		iconv_t state2 = iconv_open(outencoding, "ASCII");
+
+		if(state2 == (iconv_t)(-1))
+		  {
+		    decoded = outbufsize-outremaining;
+		    return false;
+		  }
+
+		const char *errbuf = "?";
+		size_t errbufsize = strlen(errbuf);
+
+		result = iconv(state2, const_cast<char **>(&errbuf),
+			       &errbufsize, &outbufcur, &outremaining);
+
+
+		while(result == (size_t)(-1))
+		  {
+		    if(errno != E2BIG)
+		      {
+			decoded = outbufsize-outremaining;
+			iconv_close(state2);
+			return false;
+		      }
+
+		    size_t idx = outbufcur-outbuf;
+		    outremaining += outbufsize;
+		    outbufsize *= 2;
+		    outbuf = (char *) realloc(outbuf, outbufsize);
+		    outbufcur = outbuf+idx;
+
+		    result = iconv(state2, const_cast<char **>(&errbuf),
+				   &errbufsize, &outbufcur, &outremaining);
+		  }
+
+		// Return again to initial shift state
+		result = iconv(state2, NULL, NULL, &outbufcur, &outremaining);
+		while(result == (size_t)(-1))
+		  {
+		    size_t idx = outbufcur-outbuf;
+		    outremaining += outbufsize;
+		    outbufsize *= 2;
+		    outbuf = (char *) realloc(outbuf, outbufsize);
+		    outbufcur = outbuf+idx;
+
+		    result = iconv(state2, NULL, NULL, &outbufcur, &outremaining);
+		  }
+
+		iconv_close(state2);
+
+		// Ok, skip the bad input character.
+		++inbuf;
+		--inremaining;
+	      }
+	    else
+	      {
+		size_t idx = outbufcur-outbuf;
+		outremaining += outbufsize;
+		outbufsize *= 2;
+		outbuf = (char *) realloc(outbuf, outbufsize);
+		outbufcur = outbuf + idx;
+	      }
+	  }
+      }
+
+    decoded = outbufsize-outremaining;
+
+    return rval;
+  }
+}
+
 // UTF8ToCodeset - Convert some UTF-8 string for some codeset   	/*{{{*/
 // ---------------------------------------------------------------------
 /* This is handy to use before display some information for enduser  */
@@ -43,7 +219,7 @@
 {
   iconv_t cd;
   const char *inbuf;
-  char *inptr, *outbuf, *outptr;
+  char *outbuf;
   size_t insize, outsize;
   
   cd = iconv_open(codeset, "UTF-8");
@@ -61,17 +237,16 @@
      return false;
   }
 
-  insize = outsize = orig.size();
+  insize = outsize = orig.size() + 1;
   inbuf = orig.data();
-  inptr = (char *)inbuf;
-  outbuf = new char[insize+1];
-  outptr = outbuf;
-
-  iconv(cd, &inptr, &insize, &outptr, &outsize);
-  *outptr = '\0';
-
-  *dest = outbuf;
-  delete[] outbuf;
+  outbuf = NULL; // transcode_buffer will initialize this.
+  size_t num_decoded;
+
+  transcode_buffer(cd, outbuf, outsize, inbuf, insize,
+		   num_decoded, codeset);
+
+  dest->assign(outbuf, outsize);
+  free(outbuf);
   
   iconv_close(cd);

Reply to:

Follow-Ups:
- Bug#487230: Processed: kimiwitu
  - From: Daniel Burrows <dburrows@debian.org>

Prev by Date: Bug#487330: apt-cache depends --recursive is completely bogus
Next by Date: Processed: tagging 195018
Previous by thread: DE FRANCIS DAVID KONE.
Next by thread: Bug#487230: Processed: kimiwitu
Index(es):
- Date
- Thread