[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#99933: second attempt at more comprehensive unicode policy



On Sat, 2003-01-04 at 10:55, Robert Bihlmeyer wrote:
> Colin Walters <walters@debian.org> writes:
> 
> > > As I see it, the current (broken ?) behaviour is, to use the user's
> > > locale setting (LC_CTYPE) to encode file names.  
> > 
> > It appears so, and yes, this behavior is completely and fundamentally
> > broken.
> 
> Whether or not this is broken is debatable. 

I don't think so.  I have put forth many real-world scenarios in which
using national charsets for filenames simply breaks, in ways that are
basically impossible to fix.  You may be able to get away with using a
national charset on a machine where everyone speaks the same language,
and never interacts with speakers of another language, but that's about
it.

What *is* debatable is when and how to make the transition, which is
what we're doing now.

> It is the current status
> quo, though, on a majority of systems. Breaking that nilly-willy is
> not acceptable.

Again, my policy proposal does *not* (I am 95% sure) create any new RC
bugs.  The only "must" is for filenames actually included in packages.  

I actually wrote another lintian patch for this (attached) which I ran
over my small sample of .debs, and found no new bugs.  It requires my
patch for GNU tar; see:
http://bugs.debian.org/175089

Using UTF-8 for programs in general, in my patch, is just a "should".

> I'd prefer:
> 
> 1. Programs are extended to handle UTF8 filenames iff LC_CTYPE is
>    UTF8. Programs that right now cope with other charsets can keep
>    this support if LC_CTYPE is set to any other value (even C).
>    Filenames incompatible with the current locale must be handled
>    reasonably.

First of all, there is no need for 'if and only if'.  Programs can
always try to decode filenames in UTF-8, and if that fails, then try the
locale's charset.

Would this make you happy if I modified my policy proposal to do this?  
Again, note this part of my proposal is still not a "must".  Your
programs will not get RC bugs for a lack of UTF-8 support for filenames.

> Once this is implemented for a resonable percentage of packages:
> 
> 2. An UTF8 locale is made the default on new installations. For
>    upgrades scripts are provided to convert filesystem trees over to
>    UTF8. Do a release.
> 
> 3. Support for non-UTF8 charsets is deprecated, removed, or succumbs
>    to bit rot.

I agree with this wholeheartedly.

> Yeah, and the Gnome2 file dialog completely ignores my latin1
> filenames. That's best practise?

Well, you might have to set G_BROKEN_FILENAMES.  But this is the whole
reason we are switching to UTF-8; so programs will not have to deal with
the nightmare of recoding filenames!  If you feel strongly however you
could lobby the GNOME maintainers to default to falling back
automatically to the national encoding if UTF-8 decoding fails.

> Anyway, for my daily living Gnome2 is a quite irrelevant chunk of
> software. aterm, zsh, xemacs, mozilla are much more important. Only
> half of these support UTF8 right now AFAIK. I'd guess from the
> 80%-software in Debian less than 50 % handle UTF8.

I've noticed that UTF-8 sometimes makes zsh unhappy, but other than that
basically all the software I use every day (evolution, gnome-terminal,
GNU Emacs (well, from CVS), nautilus, and galeon) supports UTF-8
filenames.

--- lintian-1.22.4/checks/files	2003-01-02 12:46:17.000000000 -0500
+++ lintian-1.22.4.hacked/checks/files	2003-01-02 15:24:19.000000000 -0500
@@ -21,6 +21,7 @@
 
 use strict;
 use utf8;
+use Encode 'decode';
 
 ($#ARGV == 1) or fail("syntax: files <pkg> <type>");
 my $pkg = shift;
@@ -69,6 +70,12 @@
     my $link;
     my $operm;
 
+    if (not decode('utf8', $file)) {
+        my $quotedfile = quote_string ($file);
+        print "W: $pkg $type: package-contains-filename-using-obsolete-charset $quotedfile\n";
+      # FIXME: should we continue here?
+    }
+
     $file =~ s,^\./,,;
 
     if ($file =~ s/ link to .*//) {
@@ -596,3 +603,12 @@
 
     return $o;
 }
+
+# hacked up from perluniintro
+sub quote_string {
+  join("",
+       map { $_ < 32 || $_ > 127 ?
+	       sprintf("\\x%02X", $_) :
+		 chr($_)
+		   } unpack("C*", $_[0]));
+}
--- lintian-1.22.4/checks/files.desc	2003-01-02 12:45:54.000000000 -0500
+++ lintian-1.22.4.hacked/checks/files.desc	2003-01-02 12:48:27.000000000 -0500
@@ -430,7 +430,7 @@
  directory. It was most likely installed by accident, since one examples/
  directory should be enough for everybody(tm).
 
-Tag: package-contains-file-with-name-in-obsolete-charset
+Tag: package-contains-filename-using-obsolete-charset
 Type: warning
 Info: All filenames used in a package should be valid UTF-8, an
  encoding of the Unicode character set.

Reply to: