[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 in jessie



On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
> Detecting non-UTF files is easy:
> * false positives are impossible
> * false negatives are extremely unlikely: combinations of letters that would
>   happen to match a valid utf character don't happen naturally, and even if
>   they did, every single combination in the file tested would need to match
>   valid utf.

Not that unlikely, and it is rather annoying that Firefox (and
therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
IMHO, in case of ambiguity, UTF-8 should always be preferred by
default (applications could have options to change the preferences).

Bug reports:
  https://bugzilla.mozilla.org/show_bug.cgi?id=760050
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719481

> On the other hand, detecting text files is hard.

Deciding whether a file is a text file may be hard even for a human.
What about text files with ANSI control sequences?

> The best tool so far, "file", makes so many errors it's useless for
> this purpose.

Yes.

> One could use location: like, declaring stuff in /etc/ and
> /usr/share/doc/ to be text unless proven otherwise, but that's an
> incomplete hack. Only hashbangs can be considered reliable, but
> scripts are not where most documentation goes.
> 
> Also, should HTML be considered text or not?  Updating http-equiv is not
> rocket surgery, detecting HTML with fancy extensions can be.

I think better questions could be: why do you want to regard a file as
text? For what purpose(s)? For the "all shipped text files in UTF-8"
rule only?

What about examples whose purpose is to have a file in a charset
different from UTF-8?

> 4a. perl and pod
> 
> Considering perl to be text raises one more issue: pod.  By perl's design,
> pod without a specified encoding is considered to be ISO-8859-1, even if
> the file contains "use utf8;".  This is surprising, and many authors use
> UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one
> example).  Thus, there should be a tool (preferably the one mentioned
> above) that checks perl files for pod with undeclared encoding, and raises
> alarm if the file contains any bytes with high bit set.  If a conversion
> encoding is specified, such a declaration could be added automatically.

Yes, undeclared encoding when not ASCII should be regarded as a bug.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: