Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 02 Jan 2003 17:25:15 -0500
Message-id: <[🔎] 1041546314.22038.9.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 1041533855.15063.19.camel@space-ghost>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost>

On Thu, 2003-01-02 at 13:57, Colin Walters wrote:

> #99933 goes a lot farther than #174982.

I have a counter-proposal to #99933, which I have attached.  I believe
it fixes the problems I raised with your proposal, and should also cover
some new areas (like filenames).  I also hopefully fixed James' issue
with the RFC link.

This patch supplants the one in #174982.  It is more ambitious than
#174982, but still does not introduce any "must"s, only "should"s or
weaker. 

Opinions?

--- policy.sgml	2003-01-01 21:59:26.000000000 -0500
+++ policy.sgml.new	2003-01-02 17:14:56.000000000 -0500
@@ -2258,10 +2258,8 @@
 	</p>
 
 	<p>
-	  The entire changelog must be encoded in the
-	  <url id="http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html"; name="UTF-8">
-	  encoding of
-	  <url id="http://www.unicode.org/"; name="Unicode">.
+	  The entire changelog should be encoded UTF-8; see <ref
+	  id="unicode"> for more information.
 	</p>
 	
 	<sect1><heading>Defining alternative changelog formats</heading>
@@ -4190,6 +4188,31 @@
       <sect>
 	<heading>Filesystem hierarchy</heading>
 
+	<sect1>
+	  <heading>File Names</heading>
+
+	  <p>
+	    Files included in Debian packages or created by maintainer
+	    scripts must have names which are valid UTF-8.  Since
+	    UTF-8 is fully backwards compatible with ASCII, few
+	    packages will encounter trouble with this.
+	  </p>
+
+	  <p>
+	    Programs should expect filenames in general (whether from
+	    a Debian package or created by the user) to be encoded
+	    with UTF-8, although it is recommended for programs to try
+	    gracefully falling back to the current locale's encoding
+	    if this fails.  Programs included in Debian packages
+	    should, when creating new files, encode their names in
+	    UTF-8 by default.
+	  </p>
+
+	  <p>
+	    See <ref id="unicode"> for more information on Debian and
+	    Unicode.
+	  </p>
+	</sect1>
 
 	<sect1>
 	  <heading>Filesystem Structure</heading>
@@ -5414,6 +5437,32 @@
 	</p>
       </sect>
 
+      <sect id="unicode">
+	<heading>Unicode</heading>
+
+	<p>
+	  Debian is moving towards
+	  <url id="http://www.unicode.org/"; name="Unicode">,
+	  and specifically the <url id="http://www.ietf.org/rfc/rfc2279.txt"; name="UTF-8">
+	  encoding of Unicode, for representation of character data.
+	  Unicode is a universal character set, able to encode all the
+	  world's languages.  Using Unicode makes internationalization
+	  much easier, since programs will have to deal with only one
+	  character set, instead of many different incompatible
+	  national variants.
+	</p>
+
+	<p>
+	  The UTF-8 encoding of Unicode is designed for Unix-like
+	  systems such as Debian.  It is fully backwards compatible
+	  with US-ASCII, and is also safe for use in filenames, since
+	  no ASCII character appears as part of a multibyte character.
+	  It is highly recommended, although not yet required, for
+	  programs included in Debian to support Unicode and
+	  specifically UTF-8.
+	</p>
+      </sect>
+
       <sect>
 	<heading>Environment variables</heading>
 
@@ -7647,6 +7696,42 @@
 	</p>
 
 	<p>
+	  All documentation included in a package should be encoded in
+	  UTF-8 (see <ref id="unicode"> for more information).  If
+	  upstream documentation is in another character set, the data
+	  should be converted during the package build process.
+	  <footnote>
+	    <p>
+	      One good way to do this is to use <prgn>iconv</prgn>, like:
+<example>
+	for file in ChangeLog doc/README doc/INSTALL; do
+	  iconv -f ISO-8859-1 -t UTF-8 $file &gt; $file.new && mv $file.new $file
+	done
+</example>
+	    </p>
+	  </footnote>
+	</p>
+
+	<p>
+	  Documentation formats which include a standard means of
+	  specifying the character set of the data (such as
+	  XML's <tt>encoding</tt> tag), may at their option use
+	  another character set, although UTF-8 is still preferred.
+	  Additionally, it is recommended for document formats which
+	  are capable of specifying the character set of their data,
+	  and do not have a default (like HTML), to do so.
+	  <footnote>
+	    <p>
+	      As an example, for HTML documents, the <tt>head</tt>
+	      section should include a header like:
+<example>
+  &lt;META content='text/html; charset=UTF-8' http-equiv='Content-Type'/&gt;
+</example>
+	    </p>
+	  </footnote>
+	</p>
+
+	<p>
 	  Other formats such as PostScript may be provided at the
 	  package maintainer's discretion.
 	</p>

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Wichert Akkerman <wichert@wiggy.net>
- Re: Bug#99933: second attempt at more comprehensive unicode policy
  - From: Manoj Srivastava <srivasta@debian.org>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>

Prev by Date: Bug#175064: Debian policy documents should be UTF-8 encoded
Next by Date: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Previous by thread: Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread