Re: charsets in debian/control

To: debian-devel@lists.debian.org
Subject: Re: charsets in debian/control
From: Daniel Burrows <dburrows@debian.org>
Date: Tue, 7 Dec 2004 11:34:40 -0500
Message-id: <[🔎] 200412071134.47679.dburrows@debian.org>
In-reply-to: <[🔎] 20041207154004.GA8011@fluff>
References: <[🔎] 20041205093921.GA29883@p12n.org> <[🔎] 200412071017.32119.dburrows@debian.org> <[🔎] 20041207154004.GA8011@fluff>

On Tuesday 07 December 2004 10:40 am, Richard Atterer wrote:
> No, you do not have to do this. You can keep working with "char", the
> changes when switching to UTF-8 will mostly have to deal with the fact that
> one Unicode character is represented by more than one char. This means that
> you need to use a different strlen function, take care only to chop strings
> of char at character boundaries, ensure that input strings are actually
> valid UTF-8, etc.

  This might work for programs that relatively blindly manipulate character 
strings and can pass them off to the terminal for processing.  In fact, 
aptitude does a *lot* of processing and formatting of strings internally.  
That means, for instance: splitting strings into words and paragraphs, 
truncating strings, finding out how wide strings are.

  More importantly, it also makes significant (and increasing) use of strings 
annotated with the terminal attributes of each character (think colors, 
bold/reverse video, etc).  Needless to day, it performs all of the above 
operations on those strings as well.

  All of these are impacted by extended charsets: for instance, you need to 
use a different function to find whitespace, and combining characters with 
their attributes requires the use of a structure where an integer previously 
sufficed.  That's not to mention finding the length of a string, which is 
necessary to perform most types of layout.

    The changes that are necessary are at least:

  At a minimum, the class used for formatted strings will have to be 
re-targeted to support either formatted wide strings or formatted utf8 
strings.  If wide characters or are not used internally, it is also necessary 
to audit every occurrence of s.size() and check whether the length-in-memory 
or the length-in-characters of the string is being queried.  If neither wide 
characters nor a utf8-specialized basic_string are used, it is necessary to 
audit every string constructor (which might cut a substring) and make sure 
that it doesn't play havoc with utf8 codings.  Every use of isspace() and 
friends will have to be replaced with Unicode-aware equivalents.

  And that's just the problems I can think of off the top of my head.

  It's also necessary to use a completely different set of terminal i/o 
routines, but this is pretty much expected.

  None of these problems are insurmountable, of course, and I know pretty much 
how to solve must of them.  However, it's also true that none of them exist 
*at all* when using iso-8859-1, which is why I object to the comment that 
it's no harder to handle utf8 than iso-8859-1.  (in fact, if your terminal 
speaks iso-8859-1, aptitude will handle it just fine without any changes)

  Daniel

-- 
/------------------- Daniel Burrows <dburrows@debian.org> ------------------\
|              Hi, I'm a .signature virus!                                  |
|              Copy me into your .signature to help me spread!              |
\---------------- The Turtle Moves! -- http://www.lspace.org ---------------/

Attachment: pgpGbO_pCHqSG.pgp
Description: PGP signature

Reply to:

References:
- charsets in debian/control
  - From: Peter Samuelson <peter@p12n.org>
- Re: charsets in debian/control
  - From: Daniel Burrows <dburrows@debian.org>
- Re: charsets in debian/control
  - From: Richard Atterer <richard@list04.atterer.net>

Prev by Date: Re: charsets in debian/control
Next by Date: Bug#284642: ITP: dpkg-reversion -- change the version of a DEB file
Previous by thread: Re: charsets in debian/control
Next by thread: Re: charsets in debian/control
Index(es):
- Date
- Thread