[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#208011: [PROPOSAL] UTF-8 encoding for debian/control



Package: debian-policy
Version: 3.6.1.0
Severity: wishlist
Tags: patch

This proposal aims to use UTF-8 encoding not only for debian/changelog,
but also for debian/control. Patch attached, as well as plain text for
better reading.

Kind regards,

Martin
--- debian-policy-3.6.1.0.orig/policy.sgml	Tue Aug 19 14:32:23 2003
+++ debian-policy-3.6.1.0/policy.sgml	Sun Aug 31 13:30:14 2003
@@ -2250,6 +2250,12 @@
 	  See <ref id="substvars"> for details.
 	</p>
 
+	<p>
+	  It is recommended that the control fields be encoded in 
+	  UTF-8 encoding, see <ref id="pkg-dpkgchangelog"> for
+	  further information on this.
+	</p>
+
       </sect>
 
       <sect id="binarycontrolfiles">
5.2. Source package control files -- `debian/control'
-----------------------------------------------------

     The `debian/control' file contains the most vital (and
     version-independent) information about the source package and about
     the binary packages it creates.

     The first paragraph of the control file contains information about the
     source package in general.  The subsequent sets each describe a binary
     package that the source tree builds.

     The fields in the general paragraph (the first one, for the source
     package) are:
        * `Source' (mandatory)
        * `Maintainer' (mandatory)
        * `Section' (recommended)
        * `Priority' (recommended)
        * `Build-Depends' et al
        * `Standards-Version' (recommended)

     The fields in the binary package paragraphs are:
        * `Package' (mandatory)
        * `Architecture' (mandatory)
        * `Section' (recommended)
        * `Priority' (recommended)
        * `Essential'
        * `Depends' et al
        * `Description' (mandatory)

     The syntax and semantics of the fields are described below.

     These fields are used by `dpkg-gencontrol' to generate control files
     for binary packages (see below), by `dpkg-genchanges' to generate the
     `.changes' file to accompany the upload, and by `dpkg-source' when it
     creates the `.dsc' source control file as part of a source archive.

     The fields here may contain variable references - their values will be
     substituted by `dpkg-gencontrol', `dpkg-genchanges' or `dpkg-source'
     when they generate output control files.  See Section 4.9, `Variable
     substitutions: `debian/substvars'' for details.

     It is recommended that the control fields be encoded in UTF-8
     encoding, see Section C.2.2, ``debian/changelog'' for further
     information on this.


C.2.2. `debian/changelog'
-------------------------

     See Section 4.4, `Debian changelog: `debian/changelog''.

     It is recommended that the entire changelog be encoded in the UTF-8
     (http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html) encoding of
     Unicode (http://www.unicode.org/).[1]

[1]  Support for Unicode, and specifically UTF-8, is steadily increasing
     among popular applications in Debian.  For example, in unstable, GNOME
     2 has excellent support (almost level 2) in almost all its
     applications; the big remaining one is gnome-terminal, of which one
     requires development versions in order to support UTF-8 (available in
     Debian experimental now if you want to play).  I think that by the
     time sarge is released, UTF-8 support will start to hit critical mass.

     I think it is fairly obvious that we need to eventually transition to
     UTF-8 for our package infrastructure; it is really the only sane
     charset in an international environment.  Now, we can't switch to
     using UTF-8 for package control fields and the like until dpkg has
     better support, but one thing we can start doing today is requesting
     that Debian changelogs are UTF-8 encoded.  At some point in time, we
     can start requiring them to do so.

     Checking for non-UTF8 characters in a changelog is trivial.  Dump the
     file through

          iconv -f utf-8 -t ucs-4

     discard the output, and check the return value.  If there are any
     characters in the stream which are invalid UTF-8 sequences, iconv will
     exit with an error code; and this will be the case for the vast
     majority of other character sets.

Attachment: pgpsjmFtvhA1w.pgp
Description: PGP signature


Reply to: