[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature



On Sat, Jun 24, 2017 at 2:51 AM, Colin Watson <cjwatson@debian.org> wrote:
> On Fri, Jun 23, 2017 at 11:49:20PM -0700, Russ Allbery wrote:
>> I'm still a bit dubious about this, since I don't believe editors and
>> generators normally add it, but given how we generate the text versions of
>> the documents, it's relatively easy to add a leading BOM and seems
>> harmless.  I'll take a look.
>
> I share the discomfort in your previous message with using the UTF-8
> BOM.  I'd have thought that a better approach here would be to fix this
> at the HTTP layer:
> https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt
> (and other text files here) should return "Content-Type: text/plain;
> charset=UTF-8", not just "Content-Type: text/plain".
>
> --
> Colin Watson                                       [cjwatson@debian.org]

If a ".txt" file is delivered with an HTTP header that includes the
UTF-8 "charset" tag that would hopefully fix it.  I spent a bit
experimenting with the Firefox version installed in Stretch to see if
there was a setting that would display the file correctly as well.

Russ, you are correct that the Unicode standard counseled against
using the UTF-8 version of the BOM in earlier days.  That was for
standalone text files not necessarily served as pages on the web.
Now, however, HTML5 browsers are required to recognize this sequence
and so that guidance has loosened up; see

https://www.w3.org/International/questions/qa-byte-order-mark

"In HTML5 browsers are required to recognize the UTF-8 BOM and use it
to detect the encoding of the page, and recent versions of major
browsers handle the BOM as expected when used for UTF-8 encoded
pages."

Although the next paragraph ends with: "However, bear in mind that it
is always a good idea to declare the encoding of your page using the
meta element, in addition to the BOM, so that the encoding is apparent
to people looking at the source text."  That implies using the <meta
charset="utf-8"> tag, which is intended for HTML documents, not plain
text files.

With the Firefox version installed in Stretch, the "Text Mode" button
was not in the toolbar by default.  When I added it to the toolbar and
went to select "UTF-8" to try to come up with a way of viewing the
text file, it had options of "Unicode" and "Western", but there was no
choice for UTF-8.  Choosing "Unicode" for that file
("upgrading-checklist.txt") did not change the appearance; I would
have expected "Unicode" to imply UTF-8 in a web browser.  Adding the
three-byte sequence at the start of the file did.  At that point, I
posted this bug report.

Alternatively, if convenient, you could convert the non-breaking space
characters to a plain space in that text file in a script.  That will
avoid the problem until you need some other non-ASCII character in the
file other than non-breaking space.  You could convert all of those
non-breaking space characters to ordinary spaces in one fell swoop
with:

sed -i 's/\o302\o240/ /g' upgrading-checklist.txt

If that file is served as an HTML page anywhere, the UTF-8
non-breaking space could be converted to the HTML entity "&nbsp;" to
avoid non-ASCII content.

Whatever route is taken (modifying the text file or making changes in
the Debian web server or something else), it would be nice if
eventually that text file rendered correctly in the Firefox browser
that is on the Stretch desktop.  (I'm using GNOME by the way.)

Of course, this whole thing is really a minor issue.


Paul Hardy


Reply to: