[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#865713: Declaring a charset of UTF-8 for policy files



On Sat, Jun 24, 2017 at 8:07 PM, Paul Hardy <unifoundry@gmail.com> wrote:
> On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise <pabs@debian.org> wrote:
>> On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote:
>>
>>> For what it's worth, I agree that declaring the correct charset in HTTP
>>> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK SPACE
>>> (aka the "byte-order mark") in the file content.
>
> Yes, the BOM was only intended for UTF-16, which could actually have
> two different byte orders.  Because there is no such thing as "byte
> order" with UTF-8, the world wide web has rebranded the UTF-8
> three-byte version of U+FEFF as the "UTF-8 signature".  The original
> intention of The Unicode Consortium was that the sequence would never
> be used in a UTF-8 document.

Russ and I and one other person have alluded to this, but I thought
I'd give the exact text from the Unicode 10.0.0 Standard, which was
just released half a week ago.  The quote is on the bottom of page 67.
The Standard is available at

http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

You can see that times (and attitudes towards the BOM) change (I added
the "0x" for hexadecimal; the Standard uses subscript 16):

"Unicode Signature. An initial BOM may also serve as an implicit
marker to identify a file as containing Unicode text. For UTF-16, the
sequence 0xFE 0xFF (or its byte-reversed counterpart, 0xFF 0xFE) is
exceedingly rare at the outset of text files that use other character
encodings. The corresponding UTF-8 BOM sequence, 0xEF 0xBB 0xBF, is
also exceedingly rare. In either case, it is therefore unlikely to be
confused with real text data. The same is true for both single-byte
and multibyte encodings.

"Data streams (or files) that begin with the U+FEFF byte order mark
are likely to contain Unicode characters. It is recommended that
applications sending or receiving untyped data streams of coded
characters use this signature. If other signaling methods are used,
signa- tures should not be employed."


Paul Hardy


Reply to: