[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

manpage character cleanup for UTF-8 compatibility



Hello,

Using a UTF-8 locale, I've been finding many manpages using incorrect
characters.  Groff converts many of these characters to reasonable
characters in ASCII locales, but some things break in UTF-8 locales.

First of all, '-' renders as a hyphen (U+2010) instead of as ASCII 0x2D.
The correct groff escape to use in things like command-line options is
'\-', which renders as the 0x2D minus sign in both UTF-8 and ASCII
locales.  Hyphenated words such as "read-only" or "command-line" should
properly be printed with actual hyphens instead of minus signs, and do
not need to be changed.  For clarity, though, I recommend that
intentinoal hyphens be specified with the escape \(hy, to emphasize that
they are actually intended to by hyphens and not mistakenly-unescaped
minus signs.  This also allows one to easily check whether the manpage
contains any unescaped minus signs using the regex /[^\\]-/.  I've been
replacing all unescaped - signs with either \- or \(hy as
appropriate.

Also, the use of '`' for left quotes, and sometimes '``' for left double
quotes.  The current situation with quotes is unclear, since groff
doesn't really do what groff(7) says it should for ASCII 0x27 and 0x60
(apostrophe and grave accent, respectively).  The man page indicates
that 0x27 should be rendered as U+0027, but it is errantly rendered as
U+2019 (right single quote).  Similarly, 0x60 should be rendered as
U+0060, but is instead rendered as U+2018.  This is probably to make the
obsolete use of single quotes like `this' look pretty.

Quotes should probably look like this instead:

\(oqthis text is enclosed in single quotes\(cq
\(lqthis text is enclosed in double quotes\(rq

The above render as directional "curly" quotes in a UTF-8 environment,
or as regular straight quotes such as "these" in an ASCII environment.
This should only be used in text.  Examples which intend to use the
regular double-quote (U+0022) should use literal " or \(dq.  These will
render as U+0022 in any environment, and examples can thus be
copied-and-pasted from manpage to command line successfully.

Accents: grave (U+0060) and acute (U+00B4) should be given as \` and \'
respectively.  According to groff(7), a bare, unescaped ` should also
render as "left quote, backquote (ASCII 0x27)".  The left quote (U+2018)
is different from the backquote (ASCII 0x27), so I think that "left
quote" should be deleted from the groff manpage, and groff should be
changed to display ` as `(U+0060) and not as U+2018.  In theory (after
these modifications are made) a grave accent (or backquote in shell
parlance) could be rendered via a bare `, escaped \`, or \(ga.  Since
many manpages erroneously use ` to mean \(oq, it would be best avoid the
bare ` and stick to either \` or \(ga.

Most of these things don't make any difference in ASCII locales, but
break in UTF-8 locales in which the special characters are actually
rendered specially.  For example, searching for a particular
command-line option is unncessarily difficult if it is incorrectly
specified with a hyphen instead of a minus sign.  Also, copying and
pasting examples out of manpages breaks if they're filled with multibyte
curly quotes and/or hyphens when they should be using ASCII ["-'`].

Any comments on these ideas before I start filing a ton of bugs?  I've
filed a few so far with minor severity submitted only to the
maintainers, along with patches.

I've started fixing some manpages, but have run into problems for
auto-generated manpages (for example, mutt and gpg).

I'd like to be able to just point to this message/thread in the archives
in the bug reports, rather than spelling it all out time and time again
in each bug report.  (Just patching the pages is tedious enough...).

good times,
Vineet
-- 
http://www.doorstop.net/
-- 
"Computer Science is no more about computers
than astronomy is about telescopes."  -- E.W. Dijkstra

Attachment: pgpY8e8aCN8BW.pgp
Description: PGP signature


Reply to: