UTF-8 in jessie

To: debian-devel@lists.debian.org
Subject: UTF-8 in jessie
From: Adam Borowski <kilobyte@angband.pl>
Date: Mon, 12 Aug 2013 02:51:52 +0200
Message-id: <[🔎] 20130812005152.GA28636@angband.pl>

On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote:
> now might be the right time to start a discussion about release goals
> for jessie.

I would like to propose full UTF-8 support.  I don't mean here full
support for all of Unicode's finer points, merely complete eradication of
mojibake.  That is, ensuring that /m.o/ matches "möo", or that "ä" sorts
as equal to "a""combining ¨" is out of scope of this proposal.

I propose the following sub-goals:

1. all programs should, in their default configuration, accept UTF-8 input
   and pass it through uncorrupted.  Having to manually specify encoding
   is acceptable only in a programmatic interface, GUI/std{in,out,err}/
   command line/plain files should work with nothing but LC_CTYPE.

2. all GUI/curses/etc programs should be able to display UTF-8 output where
   appropriate

3. all file names must be valid UTF-8

4. all text files should be encoded in UTF-8

This proposal doesn't call for eradication of non-UTF8 locales, even though
I think that's long overdue.  Josselin Mouette proposed that in #603914,
and I agree, but that's material for another flamewar.

Let's discuss the above points in depth: 

1. properly passing UTF-8

Text entered by an user should never get mangled.  These days, we can assume
mixed charsets are a thing of the past, thus there's no need of special
handling.  So are, mostly, programs that don't support it -- but due to
historic reasons, some are not configured to do so.  Thus, let's mandate
that no per-program steps are needed.

An example: let's say we have an SQL table foo(a varchar(250)).  Let's run
somesqlclient -e "insert into foo values('$x'); select a from foo"
(-e being whatever stands for "execute this statement").

sqlite3: ok
p[ostgre]sql: ok
mysql: doesn't work!

But... the schema was declared as UTF-8, my locale is en_US.UTF-8, why
doesn't it work?  Turns out mysql requires you to call it with an extra
argument, --default-character-set=utf8.  There's no binary ABI to maintain,
compat with some historic behaviour makes no sense.  I can accept having to
specify the charset in, say, a DBI line, as that's what the API wants, but
on the command line... that's just wrong.  Am I supposed to wrap everything
with iconv, and suffer data loss on the way?  Setting LANG/LC_foo should
be enough.

Another case, perhaps more controversial, is apache.  Just take a look at
how many of Debian random project pages have mangled encodings somewhere. 
By a 0th approximation, well over one third (more for text/plain, such as
logs).  And that's with users whose skills are way above average.
These days, producing text that's not in UTF-8 can take quite a bit of
effort, especially with modern GUI tools which don't even really pay lip
service to supporting ancient charsets anymore.  Thus, if someone serves
some text in such a charset, he takes pains to even edit it.
One argument is that because AddDefaultCharset overrides http-equiv,
such old files would be mangled.  I'd say, as they already take effort
to maintain, let's let them rot in hell, as they are a rare case that
stands in the way of a nearly ubiquitous one working properly.  Such an
admin can always configure his server to use an ancient encoding if he
wishes to do so.
(The other argument, our own files shipped in /doc/, is dead since apache
2.2.22-4, and is a major part of part 4 of this proposal.)

2. GUI/curses display

With gtk, qt, and probably more, the issue is mostly moot.  Other toolkits
might require some work, but typically it's a matter of encoding (part 1 of
this proposal): characters have different horizontal widths so you use
outside functions for functionality like line wrapping already.

Not so much in curses.  Here, you have some characters take two spaces
(CJK), some take zero (zero width spaces), some take zero but must not be
detached from the previous character (combining).  The line wrapping
algorithm is actually quite simple, but needs to be implemented for every
curses program that displays arbitrary strings.  Ouch.

[I got quite some experience fixing curses/etc programs this way, so I
pledge priority help here.  gtk/qt/fooxwidgets, not so much.]

3. all file names must be UTF-8

This is quite straightforward.  They are already uninstallable on
filesystems that operate in characters rather than bytes.  Might be a good
idea to forbid nasty stuff like newlines, tabs, etc too.

I propose to apply this restriction to source packages as well.  If
Contents-* files are to be believed, the only violation is a binary package,
zero source ones, so there'd be no extra work now, and at most a repack
if an upstream regresses.  The benefit is less clear than for binaries,
but it's trivial and would prevent unexpected breakages.

4. all shipped text files in UTF-8

We don't want mojibake in provided documentation, config files, etc.  With
the amount of hackers nearby, even perl/shell/python/etc scripts in
/*/bin.  In short, all text files.

This could be done by a debhelper tool, possibly declaratively by creating
a file containing the encoding detected non-UTF text files should be
converted from.  If your package contains some files in an ancient
encoding, you would:
echo "iso-8859-42 *" >debian/ancient_encoding
then the tool would detect text files, check if they're already UTF-8, and
if not, convert them from that iso-8859-42.  I expect 99% cases to use
just one such encoding per package, but the above syntax allows per-file
control.

Detecting non-UTF files is easy:
* false positives are impossible
* false negatives are extremely unlikely: combinations of letters that would
  happen to match a valid utf character don't happen naturally, and even if
  they did, every single combination in the file tested would need to match
  valid utf.

On the other hand, detecting text files is hard.  The best tool so far,
"file", makes so many errors it's useless for this purpose.  One could use
location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text
unless proven otherwise, but that's an incomplete hack.  Only hashbangs can
be considered reliable, but scripts are not where most documentation goes.

Also, should HTML be considered text or not?  Updating http-equiv is not
rocket surgery, detecting HTML with fancy extensions can be.

A 100% opt-in way, though, would be way too incomplete.  Ideas?

4a. perl and pod

Considering perl to be text raises one more issue: pod.  By perl's design,
pod without a specified encoding is considered to be ISO-8859-1, even if
the file contains "use utf8;".  This is surprising, and many authors use
UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one
example).  Thus, there should be a tool (preferably the one mentioned
above) that checks perl files for pod with undeclared encoding, and raises
alarm if the file contains any bytes with high bit set.  If a conversion
encoding is specified, such a declaration could be added automatically.

[I'm on the DebConf, so let's discuss.]

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: UTF-8 in jessie
  - From: Chow Loong Jin <hyperair@debian.org>
- Re: UTF-8 in jessie
  - From: Charles Plessy <plessy@debian.org>
- Re: UTF-8 in jessie
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: UTF-8 in jessie
  - From: Florian Lohoff <f@zz.de>
- Re: UTF-8 in jessie
  - From: Dmitrijs Ledkovs <xnox@debian.org>
- Re: UTF-8 in jessie
  - From: gregor herrmann <gregoa@debian.org>
- Re: UTF-8 in jessie
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Re: UTF-8 in jessie
  - From: Dmitrijs Ledkovs <xnox@debian.org>

Prev by Date: Bug#719436: ITP: autorevision -- extracts metadata about the current revision from your repository
Next by Date: Re: UTF-8 in jessie
Previous by thread: Bug#719436: ITP: autorevision -- extracts metadata about the current revision from your repository
Next by thread: Re: UTF-8 in jessie
Index(es):
- Date
- Thread