[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: charsets in debian/control

On Tue, Dec 07, 2004 at 05:56:54PM +0000, Thaddeus H. Black wrote:
> > But yes, non-ASCII Latin-1 chars should not be given
> > special status over the national chars found in other
> > languages spoken by project members.  Debian should be
> > using either ASCII, or Unicode; standardizing on
> > Latin-1 makes no sense in a global project.

> True.  Look, Steve: mild abuse aside, I agree with you
> in every particular.  Nevertheless, I would respectfully
> suggest that your criticism underscores my point, which
> regards the monstrous increase in complexity which the
> full Unicode standard represents.

Yet you had concluded this means we should use Latin-1 as an encoding for
the files.  All arguments that justify the use of Latin-1 characters in the
control file are equally applicable to any of a number of other national
character sets used by one or more developers.

> Consider.  Is it a bug if Readline cannot echo full bidirectional input?

Er, yes, sure it is, independently of what happens in debian/control.

> If Dselect does not appreciate all the non-spacing
> characters?

IFF dselect has a reason to display such characters, yes.  This may well be
the case regardless of whether debian/control ever supports non-ASCII
characters; Debian may start supporting localized Packages files via some
external mechanism, or it may provide a localized UI that requires these

> If Less does not regard Tibetan subjoined letters?  (This is my Tibetan
> straw man.)

Yes, this is also a bug.  Not one that's likely to be noticed for a while,
but a bug nevertheless.  But your example again overstates the complexity of
the task: the main responsibility of less is to figure out how many
characters to display on a line, and let the *terminal* render the glyphs.
This is code that needs to be implemented only once, and most of the work is
already done centrally for *all* apps by glibc which keeps track of the
display width of each character.

> Undoubtedly one might observe that the Tibetan problem
> were not really a problem with Less but rather with some
> underlying library, but this misses the point---or
> rather again it underscores the point.  Unicode solves
> what for many of us was not a problem by creating an
> entirely new class of problems.  For example, it
> requires us to be particular about how we tag our e-mail
> attachments...

Um, no.  Being part of a *global Internet* causes this problem for you.
The non-ASCII characters in your email were undefined gibberish according
to your headers; only naive (or "helpful", YMMV) mail readers would render
them at all, and only naive mail readers commanded by users using a Western
European locale would have rendered them as intended.  Actually, perhaps
even that is being too generous, as there are *different* native 8-bit
encodings used on each of Unix, Windows, and MacOS; the Unix and Windows
encodings differ on relatively few codepoints, but the Mac encoding is
widely different.

And you think it's ok to inflict this same mess on anyone not using a
Latin-1 locale while trying to read a debian/control file?

> Am I arguing to jettison Unicode?  No; to the partial
> extent that I had been arguing it earlier in the thread,
> you, Peter, Daniel and Matthew have changed my mind.
> However, the typical roster of skills one masters in
> contributing broadly to Debian development is already
> awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS,
> Shell, Glibc, System calls, /proc, IPC, sockets, Sed,
> Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline,
> Ncurses, TeX, Postscript, Groff, XML, assembly, Flex,
> Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit,
> Debconf, ELF, etc.---not to mention the use of the
> English language at a sophisticated technical level.
> UTF-8 is neat, but I do not really like Unicode (you may
> have noticed this).  Seeking essential simplicity, I
> would prefer to keep the full hairy overgrown Unicode
> standard from the typical Debian roster of development
> skills.  Wouldn't you?

1) Sorry, modern software is a complex creature.  This is because we demand
complex things of it -- including handling all the languages that we speak.

2) Most DDs do not master all of the above skills.  *I* don't have a mastery
of all of the above skills; "contributing broadly to Debian" usually means
mastering some of these skills, and knowing where to find answers for the

3) "Mastering Unicode", for the purposes of almost anyone not working
directly on glibc or implementing a terminal, is roughly equivalent to
"making sure your application implements proper string handling for CJK".
If you do it right, the differences between UTF-8 and ISO-2022 are normally
minimal; if you do it wrong, you get bug reports from Japanese users.
However, for files for which no encoding is specified, there is no right way
to handle non-ASCII data, which is why debian/control is an issue.

4) As suggested above, for 98% of all applications on the system, the
encoding used for debian/control is *entirely irrelevant* to the question of
whether they will need to support UTF-8.  UTF-8 is already out there and in
use, for very good (though hardly universal) reasons.  Applications that
don't handle UTF-8 get bug reports from all *kinds* of users.

Steve Langasek
postmodern programmer

Attachment: signature.asc
Description: Digital signature

Reply to: