Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Cc: starner@okstate.edu
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 15 Jan 2003 01:17:51 -0500
Message-id: <[🔎] 1042611470.28339.96.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 522157209.1042599036258.JavaMail.root@dexter.okstate.edu>
References: <[🔎] 522157209.1042599036258.JavaMail.root@dexter.okstate.edu>

On Tue, 2003-01-14 at 21:50, starner@okstate.edu wrote:

> And? A POSIX filename is not a string of characters, it's a string 
> of bytes. You have no technical need to differentiate between the
> two.

If you do any sort of character-oriented manipulation on those names,
you will.

> Good. It reminds me not have filenames that I have no way of entering
> into the computer.

Well, that may be fine for you, but can you say it's fine for everyone
in the world?

> Arguably, it's the only sane character set to use for anything.

I'm glad we agree on this much :)

> But using it for filenames and not for everything else is not
> a solution. 

Well, it's not an optimial solution, for sure; but it does solve some
problems, I think.  At the expense of creating others, admittedly; but I
think we can work to fix the latter.

> One example: You're leaving text files in the locale charset - but
> a shell script is just another text file, and needs to reference
> filenames. How do you reference a filename not in your locale 
> charset? Either bash does not recode it, and the name of non-ASCII
> files is mojibake, or you do recode it, and it's impossible to 
> reference files not in in your locale charset.

Well, hopefully most shell scripts would not be directly referencing the
files on the system, so they will continue to work.

> Making catastrophes that much more fun.

True enough.

> Are you volunteering to write patches for every program in Debian, and
> maintain them (since the upstream author probably won't be interested
> in this Debian-only scheme)?

No, but I am volunteering to write some patches for some programs.  I
think we might be able to get a fair number of upstreams to go along
with it.

> >Note again that GNOME programs and the like are already creating UTF-8
> >filenames, because they work completely in UTF-8 internally.  
> 
> Which is considered a mistake by many. 

Now, this is interesting.  I had thought that the general consensus in
the free software community at large was that UTF-8 is the only sane
charset for filenames, and to not attempt complete support for filenames
in the locale charset.  At least this is quite obviously the position
taken by GNOME.  Do you have any suitable references for projects which
take a different appproach?

I highly value your opinion, since you've shown on the lists that you
are quite knowlegeable about charset issues.

> So it fails to write the file. Big deal - you pop up a dialog box and 
> tell the user to handle it. Same thing you do with a disk full or a
> read-only directory or whatever.

Ugh.  I suppose that is possible...but ugh.

> You're ignoring scenarious like
> 
> Hacker: Access file <middle dot><middle dot>/etc/passwd
> Program 1: Hmm, <middle dot><middle dot>/etc/passwd is not in an
> illegal directory - passing through.
> Program 2: Hmm, translate to Latin-16 to stick in shell script
>            Convert <middle dot><middle dot> to ..
> Program 3: Returning password file.

By <middle dot> I'm assuming you mean U+00B7 '·'.  It seems to me that
in the chain above, Program 1 is a trusted program; it is doing
validation on network input.  So it is a bug in that program, or its
configuration, for it to execute any programs which might do something
untrusted.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Watson <cjwatson@debian.org>

References:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: starner@okstate.edu

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread