Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: starner@okstate.edu
Date: Tue, 14 Jan 2003 20:50:36 -0600 (CST)
Message-id: <[🔎] 522157209.1042599036258.JavaMail.root@dexter.okstate.edu>
Reply-to: starner@okstate.edu, 99933@bugs.debian.org

>Moreover, say the system administrator does something like 'find
>/home'.  The resulting stream will be a mixture of ISO-8859-X and BIG5,
>and impossible to reliably differentiate.  

And? A POSIX filename is not a string of characters, it's a string 
of bytes. You have no technical need to differentiate between the
two.

>And of course the problem
>doesn't just occur when you have a multiuser system; your Chinese friend
>could send you a .ogg file named using BIG5, and your Latin 1 system
>would simply fail to encode the filename.

Good. It reminds me not have filenames that I have no way of entering
into the computer.

>In summary, UTF-8 is the *only* sane character set to use for
>filenames.  

Arguably, it's the only sane character set to use for anything.
But using it for filenames and not for everything else is not
a solution. 

One example: You're leaving text files in the locale charset - but
a shell script is just another text file, and needs to reference
filenames. How do you reference a filename not in your locale 
charset? Either bash does not recode it, and the name of non-ASCII
files is mojibake, or you do recode it, and it's impossible to 
reference files not in in your locale charset.

>>  what do you expect "echo *" to do? 
>
>Quite frankly, I expect it to not work, unless they're using a UTF-8
>terminal.

Making catastrophes that much more fun.

>> You can't slap
>> filters around everything; it's horribly buggy, and error-prone and would
>> take forever to implement, IF everyone wanted to go along with it. 
>
>I am not sure.  I have a feeling we could make "core" programs like 'ls'
>and such do conversion, but I agree it would be quite a long time before
>we covered "most" of the programs people use.

Are you volunteering to write patches for every program in Debian, and
maintain them (since the upstream author probably won't be interested
in this Debian-only scheme)?

>Note again that GNOME programs and the like are already creating UTF-8
>filenames, because they work completely in UTF-8 internally.  

Which is considered a mistake by many. 

>Now, they
>*could* try to convert them back to the locale charset.  But I would
>argue strongly against this, because the conversion could fail if the
>locale's charset isn't able to encode some target characters.  That may
>be an "unlikely" scenario, but when you're dealing with something as
>fundamental as filenames, you don't want to just ignore "unlikely"
>scenarios.

So it fails to write the file. Big deal - you pop up a dialog box and 
tell the user to handle it. Same thing you do with a disk full or a
read-only directory or whatever. You're ignoring scenarious like

Hacker: Access file <middle dot><middle dot>/etc/passwd
Program 1: Hmm, <middle dot><middle dot>/etc/passwd is not in an
illegal directory - passing through.
Program 2: Hmm, translate to Latin-16 to stick in shell script
           Convert <middle dot><middle dot> to ..
Program 3: Returning password file.

It's happened - look up the Unicode root for IIS. Willy-nilly 
conversion of filenames is big trouble.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread