Re: UTF-8 in jessie

To: debian-devel@lists.debian.org
Subject: Re: UTF-8 in jessie
From: Vincent Lefevre <vincent@vinc17.net>
Date: Mon, 12 Aug 2013 23:42:12 +0200
Message-id: <[🔎] 20130812214212.GA22325@xvii.vinc17.org>
Mail-followup-to: debian-devel@lists.debian.org
In-reply-to: <[🔎] 20130812155820.GA31455@angband.pl>
References: <[🔎] 20130812005152.GA28636@angband.pl> <[🔎] 20130812135019.GB29418@pax.zz.de> <[🔎] 20130812155820.GA31455@angband.pl>

On 2013-08-12 17:58:20 +0200, Adam Borowski wrote:
> On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
> > 5. All programs consuning UTF8 Text must understand a BOM.
> 
> I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose
> once you standardize on UTF8.  They might help with exchange with a minority
> of Windows programs, at a cost at our side.  Windows hardly does plain text:
> most of that is MSVC/etc sources, but then, the C/C++ standards explicitely
> forbid junk in places other than comments.  Most other languages expect a
> hashbang on Unix, which makes BOMs impossible.

I think that BOM has more drawbacks than advantages. It could
be useful only if there were an API to handle it correctly and
transparently, and if the current API's (open(), fopen(), etc.)
were no longer used. Basically this means that one would need a
new OS. This would also mean that a BOM could be seen as some
kind of metadata used by the new API, and having the charset in
the metadata would actually make BOM completely useless.

> Other reasons:
> * concatenating files adds a misplaced BOM
> * taking stuff from the middle loses them
> * tools like grep, patch, etc pick and insert lots of individual lines
> * tools that don't care about encodings would need to learn about them
> * files that appear the same will have a different hash due to presence or
>   absence of an invisible character that can appear/disappear with no
>   explicit request on the user's part
> * with UTF-8, we're 95% there.  For BOMs, there's almost no support.

This would also affect regexp, e.g. "^foo" on the first line of a file.

> So I'm strongly against producing BOMs.  As for accepting them, there's
> little that can break so it would be mostly ok... but certainly not as
> a "must" clause.

Agreed.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to:

References:
- UTF-8 in jessie
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: UTF-8 in jessie
  - From: Florian Lohoff <f@zz.de>
- Re: UTF-8 in jessie
  - From: Adam Borowski <kilobyte@angband.pl>

Prev by Date: Bug#719526: ITP: attic -- deduplicating backup program
Next by Date: Re: UTF-8 in jessie
Previous by thread: Re: UTF-8 in jessie
Next by thread: Re: UTF-8 in jessie
Index(es):
- Date
- Thread