[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal: Accept any Latin in some control fields



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeroen van Wolffelaar <jeroen@wolffelaar.nl> writes:

> Siggy Brentrup wrote:
>> While understandable from the maintainer's point of view, luckily to
>> my knowledge no (e.g.) asian maintainer has done it yet.  If we allow
>> non ascii in control fields, I see no valid argument to prohibit any
>> character set.
>
> Well, if one goes to think why currently some maintainers very much
> would like to use non-ASCII characters, it is because their name is
> spelled using a Latin alphabet (like English, Swedish, Polish,
> French), but the ASCII characterset has no support for diacretical marks
> on those characters, or even lacks the Latin character in their name.
> This makes their name visibly wierd, it's nearly correct, but some
> diacretical marks are missing.
>
> This is different from f.e. Asian, Greek, Russian or Arab maintainers:
> their name is spelled in a completely different alphabet, which is not
> understandable if one only knows the Latin alphabet. So, they transcript
> their name into the Latin alphabet, which is already not the original
> form of writing.

Or, to summarise both: Maintainers whose names are not representable
in ASCII would like to have their names correctly written in the
changelog.  UTF-8 is a solution to both groups, but for non-Latin
names, an additional representation in Latin characters would be
needed for those not able to read/understand the native alphabet.

> So, coming back to topic, I think it is not a very strange thing to
> demand the Latin alphabet for data that is to be understandable by a
> lot of people.

But for the future, the ISO-8859-* character sets are just more
mutually-incompatible character sets.  Conceptually, they are no
different from any other, e.g. KOI8-R, CP* EBCDIC etc., and I don't
think they are worthy of special consideration.

[On second reading, I see you are referring to the /alphabet/, rather
than any specific encoding thereof.  Please ignore the above.]

> Therefore, I think a possible rule on using non-7bit-ASCII characters in
> various locations like control fields, could be that any Latin
> character, including those with diacretics. Of course in UTF-8 encoding,
> as this is the only sane encoding there is. Converted to Unicode-speak,
> this means the character groups "Basic Latin", "Latin-1 supplement",
> "Latin Extended-A" (covering up until now all Latin characters in use in
> Europe) and possibly "Latin Extended-B" (covering other Latin-like
> characters, mostly those only used in Africa).

Once we are allowed to use UTF-8 encoded files, is it really worth
restricting the allowed symbols?  I think in the case of names which
are not natively-representable in non-ASCII characters, a
translation/approximation in English/ASCII would be useful, where
appropriate, for the benefit of users not familiar with the script in
question.

I routinely use quite a few of the symbols which are available in the
UCS, such as bullet points, technical symbols and the like.  These
certainly have potential uses, for example in changelogs and
documentation.

> So, I propose that the Policy be changed such that (draft, this is
> intended as a base for discussion):
>
> - For every Debian meta-data field, value and key, the character set is
>   either 7-bit ASCII, or UTF-8 allowing only Latin characters (inclusion
>   of Latin Extended-B to be discussed)
>   [it should be noted that 7-bit ASCII is a subset of UTF-8]

Is it really worth mentioning "7-bit ASCII"?  Since it's a UTF-8
subset, it's kind of implicit.  Also, the correct charset name is
US-ASCII, and since it's always been 7-bit, this is unnecessary
information.

I would simply say it's UTF-8.

>   So, non-ASCII implies UTF-8 in all cases (one exception below)
> - Only descriptive values are elegible for UTF-8, so nor any key, nor
>   precise values like package names can be non-ASCII

OK.  For package names, this makes sense.

> - Localized strings of course may use any characters from UTF-8, or even
>   another character set if a provision exists to indicate character set
>   (for example, IIRC .po files have such a provision)

OK.

po files aren't really germane to the topic, since they are recoded
on-the-fly with iconv().

> For non-localized data, like control file fields:
> - Any value indicating a person may use UTF8 as defined above, that is,
>   Maintainer:, Uploaders:, etc.
> - Also package descriptions may use UTF8

I think if you simply state that the entire control file is UTF-8
encoded, with the restriction that package names and control field
names are US-ASCII characters only, this is clearly allowed usage.

> - possibly allow _all_ descriptive texts to use any Latin characters
>   using UTF8? Should this include any plaintext document in
>   /usr/share/doc, like changelog already is?

I suggest that you allow any valid use of UTF-8, with the suggestion
that appropriate parts of non-Latin scripts have a translation into
Latin characters.  It's important to stress "appropriate"--if the
document/package is only intended for speakers of a particular
language, it's not really appropriate, but if it's just a name, this
would be appropriate.


Regards,
Roger

- -- 
Roger Leigh

                Printing on GNU/Linux?  http://gimp-print.sourceforge.net/
                GPG Public Key: 0x25BFB848.  Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFAdbb0VcFcaSW/uEgRAvlnAKCxc4Yp5cmpxdj+YKPjxa0Z1TAnbACgteg0
cVaqvBP5LSjrQrekRk0JASg=
=6lGN
-----END PGP SIGNATURE-----



Reply to: