Hello everybody, On Mon, Jan 06, 2003 at 10:15:24PM +0100, Jochen Voss wrote: > Hello Colin, > > On Fri, Jan 03, 2003 at 09:50:26PM -0500, Colin Walters wrote: > > In summary, UTF-8 is the *only* sane character set to use for > > filenames. > At least I agree to this :-) > > I think that we need filename conversion between UTF-8 and the user's > character set, because we cannot ban all non-UTF8 terminal types. In > my opinion the main problem is, where this conversion should take > place. > > Because a lot of programs is affected, it would gain us much, if we > could move this as deep as into libc or even into the kernel. I > remember there are some questions about character sets in the kernel > configuration. Are there file-systems with in-kernel character set > conversion? > I agree, this is the only way to go. Naive, simple, classic UNIX-style programming should continue to "just work", I like the idea that I can download any old program written in a past decade and just type make. And Yes!, there are several filesystems in the Linux kernel which do character set conversions on the fly. Specifically, all the Microsoft/IBM compatible filesystems (*fat, ntfs, hpfs, iso9660) allow the DOS-side and unix-side character sets to be specified as mount options. Some versions of the smb file sharing tools also do this. And I think there is some conversion code in the text mode vt implementation (screen and keyboard) too. At least the filesystem character conversions already use UNICODE as the intermediary format, and thus the kernel includes an almost complete set of UNICODE to/from X conversion tables, each as a separate module with kerneld autoload support and all. So here is my idea of how to do it (no I have not checked what RH or others do, but I know what MS did wrong 10 years ago and I live with those mistakes as a cross platform programmer every day). 1. Unless otherwise specified here, or there are very special circumstances, all programs and libraries should assume that all strings they receive or output (including, but not limited to filenames) are in the same encoding, and make no externally visible character encoding conversion. (This is usually trivial to do, just do nothing). 2. If a program really needs to make assumptions about the character encoding of data, it should assume the character encoding specified by the locale. As a minimum, the following 3 cases must work correctly: 2.1. UTF8 2.2. iso8859-1+ defined as the single byte encoding where each byte is one character, which is its own UNICODE equivalent, and where all byte values are treated as valid, even if the corresponding UNICODE codepoint is not defined. (This character set is usually combined with the C locale to allow processing of arbitrary binary data in any unknown encoding). 2.3. any other single byte encoding where the values 0..127 are ASCII and 128..255 are graphic characters not interpreted in any particular way. Support for other multi-byte character encodings than UTF8 is not required for sarge and later, but should not be removed if it is already there. For new code, either use the libc character handling functions, or just treat anything not UTF8 as iso8859-1+ except when converting to/from UTF8. Note 2.1: Code which just treats strings as binary data already satisfy the above. Note 2.2: Code which just checks for ASCII values such as \n, / etc. and passes consecutive sequences of high-numbered chars around as is, already satisfy the above thanks to the design properties of UTF8. 3. Unless required for security or other functionality, programs and libraries should not object to processing invalid characters. (This increases the users chance of being able to deal with data in inconsistent or broken encodings, e.g. with commands such as mv M?nch.txt Maench.txt). However no conversions should cause bytes to be treated as an ASCII control char unless its encoding is exactly that ASCII byte value alone. This means not converting the "redundant" UTF8 encodings to their shortest form, but either leaving them as is or converting them to something harmless. ? is not harmless, any ASCII char other than a-zA-Z is not harmless in general context. Note 3.1: This is trivially satisfied by code which does not do convert or check character encoding at all. 4. The low level software which converts keystrokes (or other non-string input) to strings or converts strings to pixels (or other non-string output), is responsible for doing so consistently with the locale of the programs to which it provides this service, unless those programs explicitly specify otherwise. For terminal-style input/output, there will be a tool or library feature (existing or Debian-created) which does two-way conversion of character sets around a pty. This tool can / should be plugged into ssh, telnet, serial line getty and other conduits which allow terminal access from terminals that might have different locales than preferred on a given Debian system. Note 4.1: Editors, libreadline etc. are not under this rule. Those are just regular software which needs to count characters (and thus check for multibyte chars in the specified encoding). This rule is about the actual terminal interfaces, whether text or graphic. 5. Software which persists or transports strings outside the current process group, such as the name processing in filesystems, should convert strings from the current locale to a common encoding chosen by the implementor, such as UTF8, UTF16, UTF32 or in some cases another encoding. It must be possible to turn off the translation through an extra environment variable, no matter what the locale or its character encoding. For filenames or other data to which access must be possible even if it is improperly encoded, the translation code should include a well-defined escaping mechanism for accessing invalid character encodings on the medium. This code must not be enabled in other contexts, due to serious security issues (it could e.g. allow bad people to bypass code to filter out shell metacharacters etc.). This escape mechanism should allow things like tar backups to just work, no matter how confused the filenames on a disk. A mechanism needs to be devised, either in kernel or libc, which allows the conversion of filenames and console i/o to and from the process locale to indeed match the process locale. A similar or identical mechanism should be put in Xlib. 6. The base software in sarge, such as libc, Xlib, xterm must support UTF8 variants of all locales as soon as possible. Without this, the rest cannot even begin to be implemented. P.S. I am not a DD, just trying to be helpful and constructive. Cheers, Jakob -- This message is hastily written, please ignore any unpleasant wordings, do not consider it a binding commitment, even if its phrasing may indicate so. Its contents may be deliberately or accidentally untrue. Trademarks and other things belong to their owners, if any.
Attachment:
pgpE150UgL9vN.pgp
Description: PGP signature