[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded



On Tue, Jan 07, 2003 at 09:29:44AM +0100, Radovan Garabik wrote:
[...]
> > > > #99933 goes a lot farther than #174982.  First of all, we can't even
> > > > suggest that people use UTF-8 in package control fields until all our
> > > > tools support it.  Right now it is just plain broken to put anything but
> > > > ASCII in them.
> > > 
> > > But people are putting ISO-8859-1 there, now and then.
> > 
> > Yes, and it is fundamentally broken to do so, because our tools do not
> > support it.  Displaying it might happen to work on the maintainer's
> > machine, but it will probably fail in many more places around the world,
> > where people use terminals with a different native encoding type.
> > 
> > > And I am going to use UTF-8 for Maintainer: in my packages, once
> > > I have new stable mail address (and new UTF-8 GPG alias)
> > 
> > Please only use ASCII until the tools support it, and file bugs against
> > packages with control fields with characters not in ASCII.  Otherwise
> > you are just worsening the problem by adding yet another encoding to the
> > mix of ISO-8859-1, ISO-8859-2, and who knows what else is already there.
> 
> but unless someone starts actually _using_ UTF-8, we would never know
> which tools are broken and which are not (I already found one bug
> in handling of UTF-8 GPG alias - I'll file the bugreport after some more
> testing).
> And remember, this is debian *un*stable, so some breakage is to be
> expected.

[Could this discussion take place on debian-i18n?]

Mixing legacy encodings and UTF-8 looks like a bad idea, except that
we can determine whether strings are UTF-8 encoded or not.  So it makes
automatic conversion a bit harder, but it is not a real problem.

The main problem with text files is that their encoding is not specified.
All human editable text files must *explicitly* tell their encoding,
either by their content (like XML/SGML/HTML) or by their file name
(.txt documentation or man pages must contain their encoding in their
full name, naming scheme must be standardized).  This allows support
for both UTF-8 and legacy encodings.  (To Colin: you did not notice any
problem because ASCII text is UTF-8, but problems arise with all other
legacy encodings).

A good example is debconf.  Joey Hess added encoding information in 1.2.0,
legacy encodings are currently the default, and switching to UTF-8 will
take place when it is time, without any trouble.  Automatic conversion to
user's locale (including UTF-8) is performed on output.
The only problem is that very few maintainers did manage to switch to
po-debconf in order to add encoding informations into their templates files.

A similar approach could be considered for deb control files, a new
mandatory Encoding field must be added to debian/control (and automatically
put in other files when needed), which tells encoding used by all control
files.  Dpkg and friends may then perform automatic conversion (to UTF-8 or
to current user's locale) if desired.

Denis



Reply to: