[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 in jessie

On 2013-08-12 15:16:59 +0200, Adam Borowski wrote:
> On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote:
> > On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
> > > Detecting non-UTF files is easy:
> > > * false positives are impossible
> > > * false negatives are extremely unlikely: combinations of letters that would
> > >   happen to match a valid utf character don't happen naturally, and even if
> > >   they did, every single combination in the file tested would need to match
> > >   valid utf.
> > 
> > Not that unlikely, and it is rather annoying that Firefox (and
> > therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
> > IMHO, in case of ambiguity, UTF-8 should always be preferred by
> > default (applications could have options to change the preferences).
> That's the opposite of what I'm talking about: it is hard to reliably detect
> ancient encodings, because they tend to assign a character to every possible
> bit stream.  On the other hand, only certain combinations of bytes with the
> 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with
> good accuracy.  It is obviously trivial to fool such detection deliberately,
> but such combinations don't happen in real languages, and thus if something
> validates as UTF-8, it is safe to assume it indeed is.

I don't know about the exact cause making Firefox to recognize some file
as TIS-620 instead of UTF-8, but it is fooled and not deliberately.

> > > On the other hand, detecting text files is hard.
> > 
> > Deciding whether a file is a text file may be hard even for a human.
> > What about text files with ANSI control sequences?
> Same as, say, a Word97 document: not text for my purposes.  It might be
> just coloured plain text, but there is no generic way to handle that.

I think I've already seen such files as distributed text files
(documentation), or perhaps there were just backspace characters
to get bold (x\bx) and underline (x\b_). The less utility can
handle them.

> > I think better questions could be: why do you want to regard a file as
> > text? For what purpose(s)? For the "all shipped text files in UTF-8"
> > rule only?
> A shipped config file will have some settings the user may edit and comments
> he may read.  Being able to see what's going on is a prerequisite here.

However some config files may be byte-oriented (like procmailrc, AFAIK).

> HTML can include http-equiv which take care of rendering, but editing is
> still a problem.  And if you edit it, or, say, fill in some fields from a
> database, you risk data loss.  If everything is UTF-8 end-to-end, this risk
> goes away.  (I do care about plain text more, though.)

You may still have NFC/NFD problems (this is also true for filenames).

Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to: