Re: UTF-8 in jessie

To: debian-devel@lists.debian.org
Subject: Re: UTF-8 in jessie
From: Adam Borowski <kilobyte@angband.pl>
Date: Mon, 12 Aug 2013 15:16:59 +0200
Message-id: <[🔎] 20130812131659.GA21837@angband.pl>
In-reply-to: <[🔎] 20130812105035.GA28567@xvii.vinc17.org>
References: <[🔎] 20130812005152.GA28636@angband.pl> <[🔎] 20130812105035.GA28567@xvii.vinc17.org>

On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote:
> On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
> > Detecting non-UTF files is easy:
> > * false positives are impossible
> > * false negatives are extremely unlikely: combinations of letters that would
> >   happen to match a valid utf character don't happen naturally, and even if
> >   they did, every single combination in the file tested would need to match
> >   valid utf.
> 
> Not that unlikely, and it is rather annoying that Firefox (and
> therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
> IMHO, in case of ambiguity, UTF-8 should always be preferred by
> default (applications could have options to change the preferences).

That's the opposite of what I'm talking about: it is hard to reliably detect
ancient encodings, because they tend to assign a character to every possible
bit stream.  On the other hand, only certain combinations of bytes with the
8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with
good accuracy.  It is obviously trivial to fool such detection deliberately,
but such combinations don't happen in real languages, and thus if something
validates as UTF-8, it is safe to assume it indeed is.

> > On the other hand, detecting text files is hard.
> 
> Deciding whether a file is a text file may be hard even for a human.
> What about text files with ANSI control sequences?

Same as, say, a Word97 document: not text for my purposes.  It might be
just coloured plain text, but there is no generic way to handle that.
Binary formats go more into subgoal 1 of my proposal: arbitrary Unicode
input that matches your syntax should be accepted, and go out uncorrupted
(not the same as unmodified).

> > One could use location: like, declaring stuff in /etc/ and
> > /usr/share/doc/ to be text unless proven otherwise, but that's an
> > incomplete hack. Only hashbangs can be considered reliable, but
> > scripts are not where most documentation goes.
> > 
> > Also, should HTML be considered text or not?  Updating http-equiv is not
> > rocket surgery, detecting HTML with fancy extensions can be.
> 
> I think better questions could be: why do you want to regard a file as
> text? For what purpose(s)? For the "all shipped text files in UTF-8"
> rule only?

A shipped config file will have some settings the user may edit and comments
he may read.  Being able to see what's going on is a prerequisite here.

A perl/python/etc script is something our kind of folks often edit and/or
read.

A plain text file ships no encoding information, thus it can't be either
rendered nor edited comfortably if the encoding is different from the system
one.

HTML can include http-equiv which take care of rendering, but editing is
still a problem.  And if you edit it, or, say, fill in some fields from a
database, you risk data loss.  If everything is UTF-8 end-to-end, this risk
goes away.  (I do care about plain text more, though.)

> What about examples whose purpose is to have a file in a charset
> different from UTF-8?

Well, we don't convert those :)

I don't expect a package with a test suite that includes charset stuff to
make such an error by itself, but if there's a need, we could add a syntax
for exclusions.  For example, writing "verbatim" in the charset field.

> > 4a. perl and pod
> > 
> > Considering perl to be text raises one more issue: pod.  By perl's design,
> > pod without a specified encoding is considered to be ISO-8859-1, even if
> > the file contains "use utf8;".  This is surprising, and many authors use
> > UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one
> > example).  Thus, there should be a tool (preferably the one mentioned
> > above) that checks perl files for pod with undeclared encoding, and raises
> > alarm if the file contains any bytes with high bit set.  If a conversion
> > encoding is specified, such a declaration could be added automatically.
> 
> Yes, undeclared encoding when not ASCII should be regarded as a bug.

And if it's declared but not UTF-8, I'd convert it at package build time.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Reply to:

Follow-Ups:
- Re: UTF-8 in jessie
  - From: Vincent Lefevre <vincent@vinc17.net>

References:
- UTF-8 in jessie
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: UTF-8 in jessie
  - From: Vincent Lefevre <vincent@vinc17.net>

Prev by Date: Bug#719491: ITP: morris -- Nine men's morris game for the gnome desktop
Next by Date: Re: UTF-8 in jessie
Previous by thread: Re: UTF-8 in jessie
Next by thread: Re: UTF-8 in jessie
Index(es):
- Date
- Thread