[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: charsets in debian/control

On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote:
> > Defining the character set as utf-8 means that any non-unicode
> > capable application is going to have issues, yes.
> Postulate an app that is ignorant of character sets - we'll call it
> "aptitude".  Fixing it to make it accept utf-8 and spit out the correct
> encoding for its LC_CTYPE is no harder than fixing it to make it accept
> iso-8859-1 and spit out the correct encoding for its LC_CTYPE.
> And if the app already deals with charset conversions but assumes
> iso-8859-1 input, then it's trivial to fix it to assume utf-8 input.

  This is not true.

  iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset.  
Storing and manipulating iso-8859-1 strings requires no changes to internal 
datatypes (only conversions for input and output); storing and manipulating 
Unicode means you have to switch to a completely different set of 
string-handling functions for all internal operations.

  In C++ you might be able to partly finesse this by creating a replacement 
string class, but if our program (call it "aptitude") is already using a 
complex replacement string class for some tasks, and this class assumes that 
characters are 8 bits wide, this might be a slightly non-trivial task, 
especially compared to handling iso-8859-1.  Hypothetically speaking. :-)

  On the other hand, once the program is using Unicode internally, taking 
iso-8859-1 as input and producing it as output should be no problem.


  [0] According to the libc manual, only 16 bits have been assigned, but GNU 
systems use 32-bit encoding internally if the libc transcoding functions are 

/------------------- Daniel Burrows <dburrows@debian.org> ------------------\
|                              swapon /dev/ram                              |
\--- News without the $$ -- National Public Radio -- http://www.npr.org ---/

Attachment: pgpRF1J0rdRbe.pgp
Description: PGP signature

Reply to: