I support the proposal for UTF-8 manual page encoding. I don't know which groff Debian is currently using, but... On 04/09/2007, at 8:22 PM, Colin Watson wrote:
* Because our current groff implementation imposes quite strict restrictions on what input and output encodings are possible, andusually needs to know detailed information about these encodings inorder to achieve correct typography, it is if anything more important than usual for man to have an accurate idea of the document's character set.
... (second post)
groff 1.19 supports full Unicode-style composite glyphs, but the versionwe have doesn't (see the comment in my original bug report about groff versioning). Both our version and newer versions support named characters such as \[:a] or \(:a (variant spellings), again documented in groff_char(7). There's also the \N escape which can give you font-dependent numbered glyphs, which are Unicode codepoints if you happen to know that the utf8 device is in use. As above, though, these have been available and translators generallyhaven't used them; I can imagine that they're insanely cumbersome to usein practice for e.g. Japanese. So I'd really rather just support plain UTF-8 input for alphanumerics, which I think will actually get used.Do you think we will need explicit language in policy for this? For thetime being, until we have a version of groff supporting direct UTF-8 input, the implementation will require that the page be convertible tothe legacy encoding for that language using iconv (it'll use 'iconv -c'so that unknown characters are dropped rather than breaking the whole page, but all the same): so e.g. for German pages characters without a direct equivalent in ISO-8859-1 should be avoided. This seems like areasonable thing to document after man-db 2.5.0, and would cover thingslike UTF-8 hyphen characters. I'm not sure how groff will handle such characters once it does have UTF-8 input support. I suspect it would convert U+2010 to its internal"hy" glyph and render that in whatever way is appropriate for the outputdevice; that would really be ideal. However, I don't have enough information to make a decision based on that guess. In general, I think it's worthwhile for policy to make comments on encoding for purposes of interoperability and standardisation, but I'd be inclined to draw the line at filling it up with instructions on how to use groff correctly. Does this sound reasonable?
Bruno Haible very kindly created a groff-utf8 [1] some time back to help me test my pilot UTF-8 (Vietnamese) manpage translation. I don't know if he's gone any further with that, but it works fine in any of my terminal apps.
I am ashamed to confess that I haven't had time to take manpage translation any further, either, but I hope to do so. <blush>
Hmm, it doesn't look like he's had time to take this any further, but that implementation displays UTF-8 perfectly for me. Vietnamese requires UTF-8, so I'm a particularly keen UTF-8 supporter. ;)
Also, on precomposed and decomposed Unicode glyphs, there are a number of problems with displaying decomposed characters. You can get separate display of character and diacritic, even separation to the point that the character is displayed, but the accent follows the cursor around the page!
Precomposed characters are a much safer choice. They have more consistent support, and simply provide less opportunities for error.
from Clytie (vi-VN, Vietnamese free-software translation team / nhóm Việt hóa phần mềm tự do)
http://groups-beta.google.com/group/vi-VN [1] http://www.haible.de/bruno/packages-groff-utf8.html
Attachment:
PGP.sig
Description: This is a digitally signed message part