Siggy Brentrup wrote: > While understandable from the maintainer's point of view, luckily to > my knowledge no (e.g.) asian maintainer has done it yet. If we allow > non ascii in control fields, I see no valid argument to prohibit any > character set. Well, if one goes to think why currently some maintainers very much would like to use non-ASCII characters, it is because their name is spelled using a Latin alphabet (like English, Swedish, Polish, French), but the ASCII characterset has no support for diacretical marks on those characters, or even lacks the Latin character in their name. This makes their name visibly wierd, it's nearly correct, but some diacretical marks are missing. This is different from f.e. Asian, Greek, Russian or Arab maintainers: their name is spelled in a completely different alphabet, which is not understandable if one only knows the Latin alphabet. So, they transcript their name into the Latin alphabet, which is already not the original form of writing. In computerland, the Latin alphabet IS a de-facto standard, and this won't change anytime in the foreseeable future. For example, /usr and /home are in the Latin alphabet, as is any identifier, be it in filesystem or in programming language or as URL. Exceptions and enhancements are being worked on, but one can trust at least the Latin alphabet is and will remain supported. It is also my understanding that people who natively use a non-Latin alphabet, like Korean's, also do know the Latin alphabet. So, coming back to topic, I think it is not a very strange thing to demand the Latin alphabet for data that is to be understandable by a lot of people. Diacretics on the characters do not severely modify understandability and memorization possibilities of people. One may not know how to pronounce the letter 'ø' (o with a slash through it), but they generally don't know anywhere near how to pronounce 'Jeroen' either. Therefore, I think a possible rule on using non-7bit-ASCII characters in various locations like control fields, could be that any Latin character, including those with diacretics. Of course in UTF-8 encoding, as this is the only sane encoding there is. Converted to Unicode-speak, this means the character groups "Basic Latin", "Latin-1 supplement", "Latin Extended-A" (covering up until now all Latin characters in use in Europe) and possibly "Latin Extended-B" (covering other Latin-like characters, mostly those only used in Africa). So, I propose that the Policy be changed such that (draft, this is intended as a base for discussion): - For every Debian meta-data field, value and key, the character set is either 7-bit ASCII, or UTF-8 allowing only Latin characters (inclusion of Latin Extended-B to be discussed) [it should be noted that 7-bit ASCII is a subset of UTF-8] So, non-ASCII implies UTF-8 in all cases (one exception below) - Only descriptive values are elegible for UTF-8, so nor any key, nor precise values like package names can be non-ASCII - Localized strings of course may use any characters from UTF-8, or even another character set if a provision exists to indicate character set (for example, IIRC .po files have such a provision) For non-localized data, like control file fields: - Any value indicating a person may use UTF8 as defined above, that is, Maintainer:, Uploaders:, etc. - Also package descriptions may use UTF8 - possibly allow _all_ descriptive texts to use any Latin characters using UTF8? Should this include any plaintext document in /usr/share/doc, like changelog already is? --Jeroen -- Jeroen van Wolffelaar Jeroen@wolffelaar.nl (also for Jabber & MSN; ICQ: 33944357) http://Jeroen.A-Eskwadraat.nl
Attachment:
pgpERnEw18H1y.pgp
Description: PGP signature