Re: manpage character cleanup for UTF-8 compatibility
On Sun, Apr 06, 2003 at 10:44:33PM +0100, Andrew Suffield wrote:
> On Sun, Apr 06, 2003 at 09:03:31PM +0100, Colin Watson wrote:
> > One other thought has occurred to me while working on fixing certain
> > parts of man-db's locale support. Sooner or later, when groff 2 is
> > released (but not beforehand!), we're going to have to move towards
> > having all man pages encoded in UTF-8. For most languages this probably
> > isn't too bad: you just use de_DE.UTF-8 rather than de, or whatever
> > (although I'm not sure how that'd work for languages with multiple
> > regional variants). It's going to be a royal pain for English, though,
> > because currently we just put things directly in /usr/share/man, meaning
> > the C locale, and there's no C.UTF-8, probably for good reasons.
> > en_US.UTF-8 would be a poor choice because we also need en_GB.UTF-8 and
> > so on.
> AIUI, only ASCII is valid in the C locale anyway. Setting the top bit
> is an error.
True enough, and the FHS says:
For example, systems which only have English manual pages coded with
ASCII, may store manual pages (the man<section> directories) directly in
/usr/share/man. (That is the traditional circumstance and arrangement,
It's sometimes a pain to be quite that strict, though (e.g. authors'
names), so I wouldn't object to people using ISO-8859-1 accents like
\('e in /usr/share/man/man*, just as long as they're coded thus rather
than in raw ISO-8859-1.
It doesn't seem to be all that prevalent anyway. A quick script 
checking just /usr/share/man/man1 on my system only shows 28 out of
1697, and at least one of those is a false positive due to some
non-ASCII characters that are only in comments:
 find /usr/share/man/man1 -type f | while read x; do zcat "$x" | \
iconv -f ISO-8859-1 -t US-ASCII >/dev/null 2>&1 || echo "$x"; done
Colin Watson [email@example.com]