[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: charsets in debian/control



Daniel Burrows <dburrows@debian.org> wrote:

>   iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. =20
> Storing and manipulating iso-8859-1 strings requires no changes to internal=
>=20
> datatypes (only conversions for input and output); storing and manipulating=
>=20
> Unicode means you have to switch to a completely different set of=20
> string-handling functions for all internal operations.

utf-8 is an 8-bit encoding of unicode, using variable length characters.
Traditional string manipulation routines work fine, except in the case
where you need to know the number of characters rather than the number
of bytes. This is typically not a large number of areas of code.

>   [0] According to the libc manual, only 16 bits have been assigned, but GN=
> U=20
> systems use 32-bit encoding internally if the libc transcoding functions ar=
> e=20
> used.

The libc manual is out of date. We've been using more than 16 bits for a
while.

-- 
Matthew Garrett | mjg59-chiark.mail.debian.devel@srcf.ucam.org



Reply to: