Re: Proposal: Accept any Latin in some control fields
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jeroen van Wolffelaar <jeroen@wolffelaar.nl> writes:
> Siggy Brentrup wrote:
>> While understandable from the maintainer's point of view, luckily to
>> my knowledge no (e.g.) asian maintainer has done it yet. If we allow
>> non ascii in control fields, I see no valid argument to prohibit any
>> character set.
>
> Well, if one goes to think why currently some maintainers very much
> would like to use non-ASCII characters, it is because their name is
> spelled using a Latin alphabet (like English, Swedish, Polish,
> French), but the ASCII characterset has no support for diacretical marks
> on those characters, or even lacks the Latin character in their name.
> This makes their name visibly wierd, it's nearly correct, but some
> diacretical marks are missing.
>
> This is different from f.e. Asian, Greek, Russian or Arab maintainers:
> their name is spelled in a completely different alphabet, which is not
> understandable if one only knows the Latin alphabet. So, they transcript
> their name into the Latin alphabet, which is already not the original
> form of writing.
Or, to summarise both: Maintainers whose names are not representable
in ASCII would like to have their names correctly written in the
changelog. UTF-8 is a solution to both groups, but for non-Latin
names, an additional representation in Latin characters would be
needed for those not able to read/understand the native alphabet.
> So, coming back to topic, I think it is not a very strange thing to
> demand the Latin alphabet for data that is to be understandable by a
> lot of people.
But for the future, the ISO-8859-* character sets are just more
mutually-incompatible character sets. Conceptually, they are no
different from any other, e.g. KOI8-R, CP* EBCDIC etc., and I don't
think they are worthy of special consideration.
[On second reading, I see you are referring to the /alphabet/, rather
than any specific encoding thereof. Please ignore the above.]
> Therefore, I think a possible rule on using non-7bit-ASCII characters in
> various locations like control fields, could be that any Latin
> character, including those with diacretics. Of course in UTF-8 encoding,
> as this is the only sane encoding there is. Converted to Unicode-speak,
> this means the character groups "Basic Latin", "Latin-1 supplement",
> "Latin Extended-A" (covering up until now all Latin characters in use in
> Europe) and possibly "Latin Extended-B" (covering other Latin-like
> characters, mostly those only used in Africa).
Once we are allowed to use UTF-8 encoded files, is it really worth
restricting the allowed symbols? I think in the case of names which
are not natively-representable in non-ASCII characters, a
translation/approximation in English/ASCII would be useful, where
appropriate, for the benefit of users not familiar with the script in
question.
I routinely use quite a few of the symbols which are available in the
UCS, such as bullet points, technical symbols and the like. These
certainly have potential uses, for example in changelogs and
documentation.
> So, I propose that the Policy be changed such that (draft, this is
> intended as a base for discussion):
>
> - For every Debian meta-data field, value and key, the character set is
> either 7-bit ASCII, or UTF-8 allowing only Latin characters (inclusion
> of Latin Extended-B to be discussed)
> [it should be noted that 7-bit ASCII is a subset of UTF-8]
Is it really worth mentioning "7-bit ASCII"? Since it's a UTF-8
subset, it's kind of implicit. Also, the correct charset name is
US-ASCII, and since it's always been 7-bit, this is unnecessary
information.
I would simply say it's UTF-8.
> So, non-ASCII implies UTF-8 in all cases (one exception below)
> - Only descriptive values are elegible for UTF-8, so nor any key, nor
> precise values like package names can be non-ASCII
OK. For package names, this makes sense.
> - Localized strings of course may use any characters from UTF-8, or even
> another character set if a provision exists to indicate character set
> (for example, IIRC .po files have such a provision)
OK.
po files aren't really germane to the topic, since they are recoded
on-the-fly with iconv().
> For non-localized data, like control file fields:
> - Any value indicating a person may use UTF8 as defined above, that is,
> Maintainer:, Uploaders:, etc.
> - Also package descriptions may use UTF8
I think if you simply state that the entire control file is UTF-8
encoded, with the restriction that package names and control field
names are US-ASCII characters only, this is clearly allowed usage.
> - possibly allow _all_ descriptive texts to use any Latin characters
> using UTF8? Should this include any plaintext document in
> /usr/share/doc, like changelog already is?
I suggest that you allow any valid use of UTF-8, with the suggestion
that appropriate parts of non-Latin scripts have a translation into
Latin characters. It's important to stress "appropriate"--if the
document/package is only intended for speakers of a particular
language, it's not really appropriate, but if it's just a name, this
would be appropriate.
Regards,
Roger
- --
Roger Leigh
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
GPG Public Key: 0x25BFB848. Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>
iD8DBQFAdbb0VcFcaSW/uEgRAvlnAKCxc4Yp5cmpxdj+YKPjxa0Z1TAnbACgteg0
cVaqvBP5LSjrQrekRk0JASg=
=6lGN
-----END PGP SIGNATURE-----
Reply to: