Re: UTF-8 in jessie (debhelper and BOM)

To: debian-devel@lists.debian.org
Subject: Re: UTF-8 in jessie (debhelper and BOM)
From: Osamu Aoki <osamu@debian.org>
Date: Tue, 13 Aug 2013 13:44:03 +0900
Message-id: <[🔎] 20130813044403.GB19557@goofy.localdomain>
In-reply-to: <[🔎] 20130812135019.GB29418@pax.zz.de>
References: <[🔎] 20130812005152.GA28636@angband.pl> <[🔎] 20130812135019.GB29418@pax.zz.de>

Hi,

UTF-8 is a good goal indeed as principle.  

(I agree but I am struggling to update package documentation since
Japanese are known to be tough (JIS 2022/EUCJP/SHIFT-JIS/... are used)
EUC/SHIFT-JIS mixed case  can be confused with LATIN-1 easily. )

But I do not understand goal #5.  Why "MUST"?  Do you have rationale?

On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
> On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote:
> > I propose the following sub-goals:
...
> > 4. all text files should be encoded in UTF-8

Yes.  But it will be nice to have some support by dh_installdocs :-)
                                                  ^^^^^^^^^^^^^^

> 5. All programs consuming UTF8 Text must understand a BOM.
                                      ^^^^

I agree as "SHOULD" but should we state "MUST"? 

After all BOM has no value in UTF-8 except to upset some programs.  
See Wikipedia page: http://en.wikipedia.org/wiki/Byte_order_mark

 | The Unicode Standard permits the BOM in UTF-8, but does not require
 | or recommend its use. Byte order has no meaning in UTF-8 ...
    (pointer to the Unicode document is listed there.)

If it is only for the first byte, it is relatively easy.  But there are
text data with bogus BOM in the content.  Should program understand them
to be safe, too?

FYI: I had problem recently for PO files containing lots of BOM inside
of a text file which broke running XaTeX.  Please note TeX family of
programs have more elaborate character support than Unicode only UTF-8.
I would rather have XeTeX ...)  To me, program to filter such BOM will
be nice.  But we should not shoot a good UTF-8 program for stupid BOM
containing UTF-8 data.

Osamu

Reply to:

Follow-Ups:
- Re: UTF-8 in jessie (debhelper and BOM)
  - From: Adam Borowski <kilobyte@angband.pl>

References:
- UTF-8 in jessie
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: UTF-8 in jessie
  - From: Florian Lohoff <f@zz.de>

Prev by Date: Bug#719556: ITP: logcat-color -- a colorful alternative to "adb logcat"
Next by Date: Re: UTF-8 in jessie
Previous by thread: Re: UTF-8 in jessie
Next by thread: Re: UTF-8 in jessie (debhelper and BOM)
Index(es):
- Date
- Thread