Bug#865713: Declaring a charset of UTF-8 for policy files

To: Paul Wise <pabs@debian.org>
Cc: Simon McVittie <smcv@debian.org>, Russ Allbery <rra@debian.org>, 865713@bugs.debian.org, Stéphane Blondon <stephane.blondon@gmail.com>, Colin Watson <cjwatson@debian.org>, debian-www@lists.debian.org, Matt Kraai <kraai@ftbfs.org>
Subject: Bug#865713: Declaring a charset of UTF-8 for policy files
From: Paul Hardy <unifoundry@gmail.com>
Date: Sat, 24 Jun 2017 20:07:00 -0700
Message-id: <[🔎] CAJqvfD8eQszsidU5Vwy3-+xKXVtbE5NYLGmObAsWnhVMD5Rqzg@mail.gmail.com>
Reply-to: Paul Hardy <unifoundry@gmail.com>, 865713@bugs.debian.org
In-reply-to: <[🔎] CAKTje6G4a2-jaa0tBrZjg-EPaJghGH67_Km9MMSOs5P3wEF=9g@mail.gmail.com>
References: <[🔎] CAJqvfD-mXoDO4ypsEbTOJzTx7D-9J_m8gyw+ODYMNEE4XS+=NA@mail.gmail.com> <[🔎] 87k241j3y7.fsf@hope.eyrie.org> <[🔎] 87bmpdj2xb.fsf@hope.eyrie.org> <[🔎] 20170624095104.zcmiy4nfi7eaj6wf@riva.ucam.org> <[🔎] 87r2y9pchl.fsf_-_@hope.eyrie.org> <[🔎] 87zicxns3k.fsf@hope.eyrie.org> <[🔎] 330cc69d-e7c3-cfed-8cdd-bbbb3eab499a@gmail.com> <[🔎] 87zicx3uva.fsf@hope.eyrie.org> <[🔎] 20170625005411.uf6nu6jvqmucr7kx@perpetual.pseudorandom.co.uk> <[🔎] CAKTje6G4a2-jaa0tBrZjg-EPaJghGH67_Km9MMSOs5P3wEF=9g@mail.gmail.com>

On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise <pabs@debian.org> wrote:
> On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote:
>
>> For what it's worth, I agree that declaring the correct charset in HTTP
>> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK SPACE
>> (aka the "byte-order mark") in the file content.

Yes, the BOM was only intended for UTF-16, which could actually have
two different byte orders.  Because there is no such thing as "byte
order" with UTF-8, the world wide web has rebranded the UTF-8
three-byte version of U+FEFF as the "UTF-8 signature".  The original
intention of The Unicode Consortium was that the sequence would never
be used in a UTF-8 document.

In Firefox, if you press Ctrl+Shift+Q you will get an "Inspector".  Loading
https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt
with the Netowrk tab selected in the Inspector shows multiple tabs.
The "Console" tab gives this message, highlighted in pink [for
dramatic effect]:

"The character encoding of the plain text document was not declared.
The document will render with garbled text in some browser
configurations if the document contains characters from outside the
US-ASCII range. The character encoding of the file needs to be
declared in the transfer protocol or file needs to use a byte order
mark as an encoding signature."

So the browser is encouraging the use of this three-byte UTF-8 version
of U+FEFF, even though it was never supposed to be used in a document.
We live in an imperfect world.

Going to the Network tab, reloading the page, and clicking on "Raw
Headers" shows the following information (i just made the request
again):

Request Headers:
Host: www.debian.org
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Sat, 24 Jun 2017 20:17:13 GMT
If-None-Match: "e965-552ba67456626-gzip"
Cache-Control: max-age=0

Response Headers:
Accept-Ranges: bytes
Cache-Control: max-age=86400
Connection: Keep-Alive
Content-Encoding: gzip
Content-Length: 18592
Content-Type: text/plain
Date: Sun, 25 Jun 2017 02:10:24 GMT
Etag: "e965-552ba67456626-gzip"
Expires: Mon, 26 Jun 2017 02:10:24 GMT
Keep-Alive: timeout=5, max=100
Last-Modified: Sat, 24 Jun 2017 20:17:13 GMT
Server: Apache
Strict-Transport-Security: max-age=15552000
Vary: Accept-Encoding
X-Clacks-Overhead: GNU Terry Pratchett
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
X-XSS-Protection: 1
referrer-policy: no-referrer

So the Content-Type is "text/plain", which results in the "garbled
characters", to quote the Firefox Console window in the Inspector.  As
an aside, the Content-Encoding is "gzip", which is a good thing.

On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise <pabs@debian.org> wrote:
> Forcing every text file to UTF-8 isn't the correct solution either,
> since it breaks text files that are not encoded in UTF-8 (such as old
> dedication texts) and does not work on Debian mirrors that are not
> controlled by us.

If using the UTF-8 signature in a document is too aesthetically
distateful (and I don't disagree), and if setting the HTTP header to
denote a UTF-8 charset is not a universal solution because it will
only have effect on Debian's servers, would a tool that converted such
text files to an HTML document be desirable?  Such a hypothetical tool
would insert a meta tag in the header saying <meta charset="UTF-8">.

If that is an acceptable solution, I could put together an awk script
for Debian (if it would get used) that would employ awk's BEGIN and
END sections to wrap a UTF-8 document in HTML tags, enclosing the text
itself in <PRE>...</PRE> tags.  That would mean that Debian UTF-8
documents intended for being served on the web would have to run such
a utility and be converted into HTML pages for display.

Three possibilities seem to exist, and I am fine with any one being chosen:

1) Use the UTF-8 signature in UTF-8 text files
2) Set the HTTP headers for charset="UTF-8"
3) Convert UTF-8 text files to HTML documents for web display

Paul Hardy

Reply to:

Follow-Ups:
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Paul Wise <pabs@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Paul Hardy <unifoundry@gmail.com>

References:
- Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature
  - From: Paul Hardy <unifoundry@gmail.com>
- Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature
  - From: Colin Watson <cjwatson@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files (was: Re: Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature)
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Stéphane Blondon <stephane.blondon@gmail.com>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Russ Allbery <rra@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Simon McVittie <smcv@debian.org>
- Bug#865713: Declaring a charset of UTF-8 for policy files
  - From: Paul Wise <pabs@debian.org>

Prev by Date: Re: Proposal: move forward to Sphinx
Next by Date: Bug#865713: Declaring a charset of UTF-8 for policy files
Previous by thread: Bug#865713: Declaring a charset of UTF-8 for policy files
Next by thread: Bug#865713: Declaring a charset of UTF-8 for policy files
Index(es):
- Date
- Thread