Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded

To: Denis Barbier <barbier@linuxfr.org>, 99933@bugs.debian.org
Subject: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
From: Colin Walters <walters@debian.org>
Date: 07 Jan 2003 10:23:14 -0500
Message-id: <[🔎] 1041952994.30846.11.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20030107092933.GB7260@zobe.linuxfr.org>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 20030103164539.GA22588@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041618266.31344.25.camel@space-ghost> <[🔎] 20030107082944.GA27314@atlas15.dnp.fmph.uniba.sk> <[🔎] 20030107092933.GB7260@zobe.linuxfr.org>

On Tue, 2003-01-07 at 04:29, Denis Barbier wrote:

> > but unless someone starts actually _using_ UTF-8, we would never know
> > which tools are broken and which are not (I already found one bug
> > in handling of UTF-8 GPG alias - I'll file the bugreport after some more
> > testing).

Testing our tools' support for UTF-8 on your local system is perfectly
fine; I've been doing just that personally.  But, ...

> > And remember, this is debian *un*stable, so some breakage is to be
> > expected.

Uploading packages with UTF-8 control fields is not ok.  It will simply
put, not work for anyone who's not using a UTF-8 terminal, which is
unfortunately probably most of our users at the moment.  Just Don't Do
It.

If you really want to help push UTF-8, apply my dpkg patch, help
find/fix bugs in it, then start ensuring apt-get, aptitude, etc., all
grok UTF-8.

> [Could this discussion take place on debian-i18n?]

Actually I think we should probably move to -devel, given how strongly
this affects the system in general.  Even people who maintain programs
which care little for i18n will still have to deal with UTF-8 filenames,
and should be UTF-8 aware in general.

It looks to me like at this point almost everyone agrees with the
content of my proposal in #99933, and we are discussing implementation
details.  Agreed?

If so, another second would be cool :)  And also if that is the case,
then it makes a better argument for moving to -devel.

> Mixing legacy encodings and UTF-8 looks like a bad idea, except that
> we can determine whether strings are UTF-8 encoded or not.  

Not with perfect reliability.

> The main problem with text files is that their encoding is not specified.
> All human editable text files must *explicitly* tell their encoding,
> either by their content (like XML/SGML/HTML) or by their file name
> (.txt documentation or man pages must contain their encoding in their
> full name, naming scheme must be standardized).  This allows support
> for both UTF-8 and legacy encodings.  

You mean like changelog.txt.UTF-8 or changelog.UTF-8.txt ? I am pretty
much opposed to any sort of proposal of this form.  The reason is that
changing programs to recognize our arbitrary scheme for file encodings
will not only be a lot of work, but instead we could add support to
programs to autodetect the charset semi-intelligently from file content,
which is what programs like Emacs in the real world do today.

> (To Colin: you did not notice any
> problem because ASCII text is UTF-8, but problems arise with all other
> legacy encodings).

Actually I quite frequently notice problems with European names, as well
as the copyright character.  Do not assume that because my native
language is English that I do not experience charset problems :)

> A similar approach could be considered for deb control files, a new
> mandatory Encoding field must be added to debian/control (and automatically
> put in other files when needed), which tells encoding used by all control
> files.  Dpkg and friends may then perform automatic conversion (to UTF-8 or
> to current user's locale) if desired.

Ugh.  I am generally quite opposed to adding an Encoding field, and I
bet you'll find the dpkg maintainers are too.  It should just be UTF-8,
period.  If developers really want to, they can generate control from a
control.in file by using iconv or similar.

Reply to:

Follow-Ups:
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: barbier@linuxfr.org (Denis Barbier)
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: barbier@linuxfr.org (Denis Barbier)

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Previous by thread: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Next by thread: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Index(es):
- Date
- Thread