[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Proposal: Accept any Latin in some control fields (Was: Re: How much utf-8 do we accept in control files?)



Siggy Brentrup wrote:
> While understandable from the maintainer's point of view, luckily to
> my knowledge no (e.g.) asian maintainer has done it yet.  If we allow
> non ascii in control fields, I see no valid argument to prohibit any
> character set.

Well, if one goes to think why currently some maintainers very much
would like to use non-ASCII characters, it is because their name is
spelled using a Latin alphabet (like English, Swedish, Polish,
French), but the ASCII characterset has no support for diacretical marks
on those characters, or even lacks the Latin character in their name.
This makes their name visibly wierd, it's nearly correct, but some
diacretical marks are missing.

This is different from f.e. Asian, Greek, Russian or Arab maintainers:
their name is spelled in a completely different alphabet, which is not
understandable if one only knows the Latin alphabet. So, they transcript
their name into the Latin alphabet, which is already not the original
form of writing.

In computerland, the Latin alphabet IS a de-facto standard, and this
won't change anytime in the foreseeable future. For example, /usr and
/home are in the Latin alphabet, as is any identifier, be it in
filesystem or in programming language or as URL. Exceptions and
enhancements are being worked on, but one can trust at least the Latin
alphabet is and will remain supported.

It is also my understanding that people who natively use a non-Latin
alphabet, like Korean's, also do know the Latin alphabet.


So, coming back to topic, I think it is not a very strange thing to
demand the Latin alphabet for data that is to be understandable by a
lot of people. Diacretics on the characters do not severely modify
understandability and memorization possibilities of people. One may not
know how to pronounce the letter 'ø' (o with a slash through it), but
they generally don't know anywhere near how to pronounce 'Jeroen'
either.

Therefore, I think a possible rule on using non-7bit-ASCII characters in
various locations like control fields, could be that any Latin
character, including those with diacretics. Of course in UTF-8 encoding,
as this is the only sane encoding there is. Converted to Unicode-speak,
this means the character groups "Basic Latin", "Latin-1 supplement",
"Latin Extended-A" (covering up until now all Latin characters in use in
Europe) and possibly "Latin Extended-B" (covering other Latin-like
characters, mostly those only used in Africa).


So, I propose that the Policy be changed such that (draft, this is
intended as a base for discussion):

- For every Debian meta-data field, value and key, the character set is
  either 7-bit ASCII, or UTF-8 allowing only Latin characters (inclusion
  of Latin Extended-B to be discussed)
  [it should be noted that 7-bit ASCII is a subset of UTF-8]
  So, non-ASCII implies UTF-8 in all cases (one exception below)
- Only descriptive values are elegible for UTF-8, so nor any key, nor
  precise values like package names can be non-ASCII
- Localized strings of course may use any characters from UTF-8, or even
  another character set if a provision exists to indicate character set
  (for example, IIRC .po files have such a provision)

For non-localized data, like control file fields:
- Any value indicating a person may use UTF8 as defined above, that is,
  Maintainer:, Uploaders:, etc.
- Also package descriptions may use UTF8
- possibly allow _all_ descriptive texts to use any Latin characters
  using UTF8? Should this include any plaintext document in
  /usr/share/doc, like changelog already is?

--Jeroen

-- 
Jeroen van Wolffelaar
Jeroen@wolffelaar.nl (also for Jabber & MSN; ICQ: 33944357)
http://Jeroen.A-Eskwadraat.nl

Attachment: pgpERnEw18H1y.pgp
Description: PGP signature


Reply to: