[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#440420: [PROPOSAL] Manual page encoding



I support the proposal for UTF-8 manual page encoding.

I don't know which groff Debian is currently using, but...

On 04/09/2007, at 8:22 PM, Colin Watson wrote:

  * Because our current groff implementation imposes quite strict
    restrictions on what input and output encodings are possible, and
usually needs to know detailed information about these encodings in
    order to achieve correct typography, it is if anything more
    important than usual for man to have an accurate idea of the
    document's character set.

... (second post)

groff 1.19 supports full Unicode-style composite glyphs, but the version
we have doesn't (see the comment in my original bug report about groff
versioning). Both our version and newer versions support named
characters such as \[:a] or \(:a (variant spellings), again documented
in groff_char(7). There's also the \N escape which can give you
font-dependent numbered glyphs, which are Unicode codepoints if you
happen to know that the utf8 device is in use.

As above, though, these have been available and translators generally
haven't used them; I can imagine that they're insanely cumbersome to use
in practice for e.g. Japanese. So I'd really rather just support plain
UTF-8 input for alphanumerics, which I think will actually get used.

Do you think we will need explicit language in policy for this? For the
time being, until we have a version of groff supporting direct UTF-8
input, the implementation will require that the page be convertible to
the legacy encoding for that language using iconv (it'll use 'iconv -c'
so that unknown characters are dropped rather than breaking the whole
page, but all the same): so e.g. for German pages characters without a
direct equivalent in ISO-8859-1 should be avoided. This seems like a
reasonable thing to document after man-db 2.5.0, and would cover things
like UTF-8 hyphen characters.

I'm not sure how groff will handle such characters once it does have
UTF-8 input support. I suspect it would convert U+2010 to its internal
"hy" glyph and render that in whatever way is appropriate for the output
device; that would really be ideal. However, I don't have enough
information to make a decision based on that guess.

In general, I think it's worthwhile for policy to make comments on
encoding for purposes of interoperability and standardisation, but I'd
be inclined to draw the line at filling it up with instructions on how
to use groff correctly. Does this sound reasonable?


Bruno Haible very kindly created a groff-utf8 [1] some time back to help me test my pilot UTF-8 (Vietnamese) manpage translation. I don't know if he's gone any further with that, but it works fine in any of my terminal apps.

I am ashamed to confess that I haven't had time to take manpage translation any further, either, but I hope to do so. <blush>

Hmm, it doesn't look like he's had time to take this any further, but that implementation displays UTF-8 perfectly for me. Vietnamese requires UTF-8, so I'm a particularly keen UTF-8 supporter. ;)

Also, on precomposed and decomposed Unicode glyphs, there are a number of problems with displaying decomposed characters. You can get separate display of character and diacritic, even separation to the point that the character is displayed, but the accent follows the cursor around the page!

Precomposed characters are a much safer choice. They have more consistent support, and simply provide less opportunities for error.

from Clytie (vi-VN, Vietnamese free-software translation team / nhóm Việt hóa phần mềm tự do)
http://groups-beta.google.com/group/vi-VN

[1] http://www.haible.de/bruno/packages-groff-utf8.html


Attachment: PGP.sig
Description: This is a digitally signed message part


Reply to: