[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#241333: Mandate UTF-8 for changelog files



Russ Allbery <rra@debian.org> writes:

> The reason why I dropped the RFC reference is that there are multiple
> references to UTF-8 all over Policy these days and I don't really want
> to footnote all of them.  I'm not sure the best way to handle this.
> Maybe we need some sort of introductory mention of UTF-8 somewhere?

Here's a revised patch that adds a definition section, currently only
defining ASCII and UTF-8.  I haven't included the iconv trick; I think
that the Developers Reference or Lintian are better places for that.  But
I don't feel strongly that way and am willing to change my mind of people
disagree.

diff --git a/policy.sgml b/policy.sgml
index 24c9072..219664d 100644
--- a/policy.sgml
+++ b/policy.sgml
@@ -273,6 +273,32 @@
 	</p>
       </sect>
 
+      <sect id="definitions">
+	<heading>Definitions</heading>
+
+	<p>
+	  The following terms are used in this Policy Manual:
+	  <taglist>
+	    <tag>ASCII</tag>
+	    <item>
+	      The character encoding specified by ANSI X3.4-1986 and its
+	      predecessor standards, referred to in MIME as US-ASCII, and
+	      corresponding to an encoding in eight bits per character of
+	      the first 128 <url id="http://www.unicode.org/";
+	      name="Unicode"> characters, with the eighth bit always zero.
+	    </item>
+	    <tag>UTF-8</tag>
+	    <item>
+	      The transformation format (sometimes called encoding) of
+	      <url id="http://www.unicode.org/"; name="Unicode"> defined by
+	      <url id="http://www.rfc-editor.org/rfc/rfc3629.txt";
+	      name="RFC 3629">.  UTF-8 has the useful property of having
+	      ASCII as a subset, so any text encoded in ASCII is trivially
+	      also valid UTF-8.
+	    </item>
+	  </taglist>
+	</p>
+      </sect>
     </chapt>
 
 
@@ -1473,10 +1499,6 @@
 	</p>
 
         <p>
-          
-        </p>
-
-        <p>
           The format of the <file>debian/changelog</file> allows the
 	  package building tools to discover which version of the package
 	  is being built and find out other release-specific information.
@@ -1582,6 +1604,10 @@
 	</p>
 
 	<p>
+	  The entire changelog must be encoded in UTF-8.
+	</p>
+
+	<p>
 	  For more information on placement of the changelog files
 	  within binary packages, please see <ref id="changelogs">.
 	</p>
@@ -9822,36 +9848,6 @@ install-info --quiet --remove /usr/share/info/foobar.info
 	    See <ref id="dpkgchangelog">.
 	  </p>
 
-	  <p>
-	    It is recommended that the entire changelog be encoded in the
-	    <url id="http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html"; name="UTF-8">
-	    encoding of
-	    <url id="http://www.unicode.org/";
-	    name="Unicode">.<footnote>
-	      <p>
-		I think it is fairly obvious that we need to
-		eventually transition to UTF-8 for our package
-		infrastructure; it is really the only sane char-set in
-		an international environment.  Now, we can't switch to
-		using UTF-8 for package control fields and the like
-		until dpkg has better support, but one thing we can
-		start doing today is requesting that Debian changelogs
-		are UTF-8 encoded. At some point in time, we can start
-		requiring them to do so. 
-	      </p>
-	      <p>
-		Checking for non-UTF8 characters in a changelog is
-		trivial.  Dump the file through 
-		<example>iconv -f utf-8 -t ucs-4</example>
-                  discard the output, and check the return
-		value.  If there are any characters in the stream
-		which are invalid UTF-8 sequences, iconv will exit
-		with an error code; and this will be the case for the
-		vast majority of other character sets.
-	      </p>
-	    </footnote>
-	  </p>
-
  	  <sect2><heading>Defining alternative changelog formats
 	    </heading>
 


-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>



Reply to: