Hello, Using a UTF-8 locale, I've been finding many manpages using incorrect characters. Groff converts many of these characters to reasonable characters in ASCII locales, but some things break in UTF-8 locales. First of all, '-' renders as a hyphen (U+2010) instead of as ASCII 0x2D. The correct groff escape to use in things like command-line options is '\-', which renders as the 0x2D minus sign in both UTF-8 and ASCII locales. Hyphenated words such as "read-only" or "command-line" should properly be printed with actual hyphens instead of minus signs, and do not need to be changed. For clarity, though, I recommend that intentinoal hyphens be specified with the escape \(hy, to emphasize that they are actually intended to by hyphens and not mistakenly-unescaped minus signs. This also allows one to easily check whether the manpage contains any unescaped minus signs using the regex /[^\\]-/. I've been replacing all unescaped - signs with either \- or \(hy as appropriate. Also, the use of '`' for left quotes, and sometimes '``' for left double quotes. The current situation with quotes is unclear, since groff doesn't really do what groff(7) says it should for ASCII 0x27 and 0x60 (apostrophe and grave accent, respectively). The man page indicates that 0x27 should be rendered as U+0027, but it is errantly rendered as U+2019 (right single quote). Similarly, 0x60 should be rendered as U+0060, but is instead rendered as U+2018. This is probably to make the obsolete use of single quotes like `this' look pretty. Quotes should probably look like this instead: \(oqthis text is enclosed in single quotes\(cq \(lqthis text is enclosed in double quotes\(rq The above render as directional "curly" quotes in a UTF-8 environment, or as regular straight quotes such as "these" in an ASCII environment. This should only be used in text. Examples which intend to use the regular double-quote (U+0022) should use literal " or \(dq. These will render as U+0022 in any environment, and examples can thus be copied-and-pasted from manpage to command line successfully. Accents: grave (U+0060) and acute (U+00B4) should be given as \` and \' respectively. According to groff(7), a bare, unescaped ` should also render as "left quote, backquote (ASCII 0x27)". The left quote (U+2018) is different from the backquote (ASCII 0x27), so I think that "left quote" should be deleted from the groff manpage, and groff should be changed to display ` as `(U+0060) and not as U+2018. In theory (after these modifications are made) a grave accent (or backquote in shell parlance) could be rendered via a bare `, escaped \`, or \(ga. Since many manpages erroneously use ` to mean \(oq, it would be best avoid the bare ` and stick to either \` or \(ga. Most of these things don't make any difference in ASCII locales, but break in UTF-8 locales in which the special characters are actually rendered specially. For example, searching for a particular command-line option is unncessarily difficult if it is incorrectly specified with a hyphen instead of a minus sign. Also, copying and pasting examples out of manpages breaks if they're filled with multibyte curly quotes and/or hyphens when they should be using ASCII ["-'`]. Any comments on these ideas before I start filing a ton of bugs? I've filed a few so far with minor severity submitted only to the maintainers, along with patches. I've started fixing some manpages, but have run into problems for auto-generated manpages (for example, mutt and gpg). I'd like to be able to just point to this message/thread in the archives in the bug reports, rather than spelling it all out time and time again in each bug report. (Just patching the pages is tedious enough...). good times, Vineet -- http://www.doorstop.net/ -- "Computer Science is no more about computers than astronomy is about telescopes." -- E.W. Dijkstra
Attachment:
pgpY8e8aCN8BW.pgp
Description: PGP signature