[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#99933: third attempt at more comprehensive unicode policy



On Wed, 2003-01-15 at 06:15, Colin Watson wrote:

> I think this ought to be a reminder that taking a Debian-specific
> approach to this and reckoning that we can probably "get a fair number
> of upstreams to go along with it" is a mistake. If there isn't a
> widely-accepted standard, we will just create a mess.

I don't think it would be really Debian-specific; at least the *code*
would not be.  It would be generic in that it would give programs
Unicode and UTF-8 support, which would likely be quite easy to disable.

> Are the LSB interested in working on this?

I am a bit wary about involving them; it doesn't seem to quite fit in
with their charter.  However, I just noticed the 'Open
Internationalization Initiative', which is part of the same Free
Standards Group umbrella organization that the LSB is.  Stuff like this
does seem like it would fit in with their work; charset issues and
internationalization do go hand in hand.

However, I just looked through the most recent release of their
standard, and they appear to be silent on all the issues under debate
here; what charset to use for filenames, how to handle filenames not in
UTF-8, etc.

So...while we're investigating those organizations, given that most
(basically all) of the controversy so far has focused on filenames, I
would like to introduce a revised policy proposal which basically just
drops the second on filenames created by programs.  That way we can have
a fairly strong statement of Unicode support, but leave off most of the
"bite" until later.

This should hopefully be less controversial.  Any seconds?

--- policy.sgml	2003-01-01 21:59:26.000000000 -0500
+++ policy.sgml.new	2003-01-15 21:30:02.000000000 -0500
@@ -2258,10 +2258,8 @@
 	</p>
 
 	<p>
-	  The entire changelog must be encoded in the
-	  <url id="http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html"; name="UTF-8">
-	  encoding of
-	  <url id="http://www.unicode.org/"; name="Unicode">.
+	  The entire changelog should be encoded UTF-8; see <ref
+	  id="unicode"> for more information.
 	</p>
 	
 	<sect1><heading>Defining alternative changelog formats</heading>
@@ -4190,6 +4188,21 @@
       <sect>
 	<heading>Filesystem hierarchy</heading>
 
+	<sect1>
+	  <heading>File Names</heading>
+
+	  <p>
+	    Files included in Debian packages or created directly by
+	    maintainer scripts must have names which are valid UTF-8.
+	    Since UTF-8 is fully backwards compatible with ASCII, few
+	    packages will encounter trouble with this.
+	  </p>
+
+	  <p>
+	    See <ref id="unicode"> for more information on Debian and
+	    Unicode.
+	  </p>
+	</sect1>
 
 	<sect1>
 	  <heading>Filesystem Structure</heading>
@@ -5414,6 +5427,32 @@
 	</p>
       </sect>
 
+      <sect id="unicode">
+	<heading>Unicode</heading>
+
+	<p>
+	  Debian is moving towards
+	  <url id="http://www.unicode.org/"; name="Unicode">,
+	  and specifically the <url id="http://www.ietf.org/rfc/rfc2279.txt"; name="UTF-8">
+	  encoding of Unicode, for representation of character data.
+	  Unicode is a universal character set, able to encode all the
+	  world's languages.  Using Unicode makes internationalization
+	  much easier, since programs will have to deal with only one
+	  character set, instead of many different incompatible
+	  national variants.
+	</p>
+
+	<p>
+	  The UTF-8 encoding of Unicode is designed for Unix-like
+	  systems such as Debian.  It is fully backwards compatible
+	  with US-ASCII, and is also safe for use in filenames, since
+	  no ASCII character appears as part of a multibyte character.
+	  It is highly recommended, although not yet required, for
+	  programs included in Debian to support Unicode and
+	  specifically UTF-8.
+	</p>
+      </sect>
+
       <sect>
 	<heading>Environment variables</heading>
 
@@ -7647,6 +7686,42 @@
 	</p>
 
 	<p>
+	  All documentation included in a package should be encoded in
+	  UTF-8 (see <ref id="unicode"> for more information).  If
+	  upstream documentation is in another character set, the data
+	  should be converted during the package build process.
+	  <footnote>
+	    <p>
+	      One good way to do this is to use <prgn>iconv</prgn>, like:
+<example>
+	for file in ChangeLog doc/README doc/INSTALL; do
+	  iconv -f ISO-8859-15 -t UTF-8 $file &gt; $file.new && mv $file.new $file
+	done
+</example>
+	    </p>
+	  </footnote>
+	</p>
+
+	<p>
+	  Documentation formats which include a standard means of
+	  specifying the character set of the data (such as
+	  XML's <tt>encoding</tt> tag), may at their option use
+	  another character set, although UTF-8 is still preferred.
+	  Additionally, it is recommended for document formats which
+	  are capable of specifying the character set of their data,
+	  and do not have a default (like HTML), to do so.
+	  <footnote>
+	    <p>
+	      As an example, for HTML documents, the <tt>head</tt>
+	      section should include a header like:
+<example>
+  &lt;META content='text/html; charset=UTF-8' http-equiv='Content-Type'/&gt;
+</example>
+	    </p>
+	  </footnote>
+	</p>
+
+	<p>
 	  Other formats such as PostScript may be provided at the
 	  package maintainer's discretion.
 	</p>

Reply to: