Re: UTF-8 in jessie

To: debian-devel@lists.debian.org
Subject: Re: UTF-8 in jessie
From: Adam Borowski <kilobyte@angband.pl>
Date: Mon, 12 Aug 2013 17:58:20 +0200
Message-id: <[🔎] 20130812155820.GA31455@angband.pl>
In-reply-to: <[🔎] 20130812135019.GB29418@pax.zz.de>
References: <[🔎] 20130812005152.GA28636@angband.pl> <[🔎] 20130812135019.GB29418@pax.zz.de>

On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
> 5. All programs consuning UTF8 Text must understand a BOM.

I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose
once you standardize on UTF8.  They might help with exchange with a minority
of Windows programs, at a cost at our side.  Windows hardly does plain text:
most of that is MSVC/etc sources, but then, the C/C++ standards explicitely
forbid junk in places other than comments.  Most other languages expect a
hashbang on Unix, which makes BOMs impossible.

Other reasons:
* concatenating files adds a misplaced BOM
* taking stuff from the middle loses them
* tools like grep, patch, etc pick and insert lots of individual lines
* tools that don't care about encodings would need to learn about them
* files that appear the same will have a different hash due to presence or
  absence of an invisible character that can appear/disappear with no
  explicit request on the user's part
* with UTF-8, we're 95% there.  For BOMs, there's almost no support.

So I'm strongly against producing BOMs.  As for accepting them, there's
little that can break so it would be mostly ok... but certainly not as
a "must" clause.


-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Reply to:

Follow-Ups:
- Re: UTF-8 in jessie
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: UTF-8 in jessie
  - From: Florian Lohoff <f@zz.de>

References:
- UTF-8 in jessie
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: UTF-8 in jessie
  - From: Florian Lohoff <f@zz.de>

Prev by Date: Re: Bug#719323: ITP: jackson-core -- fast and powerful JSON library for Java
Next by Date: Re: UTF-8 in jessie
Previous by thread: Re: UTF-8 in jessie
Next by thread: Re: UTF-8 in jessie
Index(es):
- Date
- Thread