Re: UTF-8 in jessie
On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
> Detecting non-UTF files is easy:
> * false positives are impossible
> * false negatives are extremely unlikely: combinations of letters that would
> happen to match a valid utf character don't happen naturally, and even if
> they did, every single combination in the file tested would need to match
> valid utf.
Not that unlikely, and it is rather annoying that Firefox (and
therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
IMHO, in case of ambiguity, UTF-8 should always be preferred by
default (applications could have options to change the preferences).
> On the other hand, detecting text files is hard.
Deciding whether a file is a text file may be hard even for a human.
What about text files with ANSI control sequences?
> The best tool so far, "file", makes so many errors it's useless for
> this purpose.
> One could use location: like, declaring stuff in /etc/ and
> /usr/share/doc/ to be text unless proven otherwise, but that's an
> incomplete hack. Only hashbangs can be considered reliable, but
> scripts are not where most documentation goes.
> Also, should HTML be considered text or not? Updating http-equiv is not
> rocket surgery, detecting HTML with fancy extensions can be.
I think better questions could be: why do you want to regard a file as
text? For what purpose(s)? For the "all shipped text files in UTF-8"
What about examples whose purpose is to have a file in a charset
different from UTF-8?
> 4a. perl and pod
> Considering perl to be text raises one more issue: pod. By perl's design,
> pod without a specified encoding is considered to be ISO-8859-1, even if
> the file contains "use utf8;". This is surprising, and many authors use
> UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one
> example). Thus, there should be a tool (preferably the one mentioned
> above) that checks perl files for pod with undeclared encoding, and raises
> alarm if the file contains any bytes with high bit set. If a conversion
> encoding is specified, such a declaration could be added automatically.
Yes, undeclared encoding when not ASCII should be regarded as a bug.
Vincent Lefèvre <firstname.lastname@example.org> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)