[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature



Paul,

On Sun, Jun 25, 2017 at 8:24 PM, Paul Wise <pabs@debian.org> wrote:
> On Sun, 2017-06-25 at 16:07 -0700, Paul Hardy wrote:
>
>> Earlier today, I sent the GNU less maintainer a two-line patch to the
>> "charset.c" file after my original email to him.
>
> I'm no expert on the less source code, but it seems to me that it will
> also hide U+FEFF characters after the first one.

You are correct.

> I would suggest
> updating it so that <U+FEFF> is only hidden when it is the first UTF-8
> character in the file.

Well, U+FEFF has a dual personality.  It began its life as "ZERO WIDTH
NO-BREAK SPACE (ZWNBSP)".  Then that use became deprecated; new
documents were supposed to use U+2060 ("WORD JOINER") instead.  Of
course, there might still be legacy documents around, and less might
have to display them.

The alias for U+FEFF is "BYTE ORDER MARK (BOM)", and U+FEFF can go by
either of its names.

If used as a BOM, then U+FEFF is supposed to appear at the beginning
of a document according to The Unicode Standard, and discarding all
instances of U+FEFF does not accommodate that.

But the proper handling of U+FEFF as ZWNBSP is to print zero width,
and not cause a break between the surrounding characters.  You get
that alternate effect by not printing the character, which is what the
patch does.


However, in the HTML5 link I posted earlier, it mentions that a
compliant HTML5 web browser must detect the BOM anywhere within an
HTML document and if present, treat the web page as having UTF-8
encoding (or UTF-16, depending on the BOM format encountered).  They
mention the reason for this is to allow web servers to claim that they
are serving one type of content in a generic fashion, but individual
Unicode documents that are embedded in the HTML response still should
correctly display.  The presence of the BOM anywhere in the web page
is supposed to override the HTTP header charset and any META charset
tags, if present.  The latter requires interpreting U+FEFF anywhere in
the web page as a BOM.

This is a quote from p. 866 of the Unicode Standard 10.0.0 that goes
into some of the context-sensitive nature of U+FEFF (but note that
nobody is supposed to be using U+FEFF as a ZWNBSP, as of over a decade
ago):

"Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE, then all U+FEFF characters—even at the very beginning of the
text—are to be interpreted as zero width no-break spaces. Similarly,
where Unicode text has known byte order, initial U+FEFF characters are
not required, but for backward compatibility are to be interpreted as
zero width no-break spaces. For example, for strings in an API, the
memory architecture of the processor provides the explicit byte order.
For databases and similar structures, it is much more efficient and
robust to use a uniform byte order for the same field (if not the
entire database), thereby avoiding use of the byte order mark.

"Systems that use the byte order mark must recognize when an initial
U+FEFF signals the byte order. In those cases, it is not part of the
textual content and should be removed before processing, because
otherwise it may be mistaken for a legitimate zero width no-break
space. To represent an initial U+FEFF zero width no-break space in a
UTF-16 file, use U+FEFF twice in a row. The first one is a byte order
mark; the second one is the initial zero width no-break space. See
Table 23-6 for a summary of encoding scheme signatures."

Yet the only "processing" that less should be doing is outputting one
line at a time.  It does not figure out line breaks dynamically the
way a WISYWIG word processor program would, for example.  If it did,
then this dual-personality of U+FEFF would probably require
introducing an additional state variable into less.

So I think recognizing and discarding all occurrences of the BOM in
less produces the desired effect in all cases.


Paul


Reply to: