Re: manpage character cleanup for UTF-8 compatibility

To: debian-devel@lists.debian.org
Subject: Re: manpage character cleanup for UTF-8 compatibility
From: Colin Watson <cjwatson@debian.org>
Date: Sun, 6 Apr 2003 23:49:40 +0100
Message-id: <[🔎] 20030406224940.GB6681@riva.ucam.org>
Mail-followup-to: debian-devel@lists.debian.org
In-reply-to: <[🔎] 20030406214433.GB7956@doc.ic.ac.uk>
References: <20030326000151.GA12397@doorstop.net> <[🔎] 20030406200331.GB5480@riva.ucam.org> <[🔎] 20030406214433.GB7956@doc.ic.ac.uk>

On Sun, Apr 06, 2003 at 10:44:33PM +0100, Andrew Suffield wrote:
> On Sun, Apr 06, 2003 at 09:03:31PM +0100, Colin Watson wrote:
> > One other thought has occurred to me while working on fixing certain
> > parts of man-db's locale support. Sooner or later, when groff 2 is
> > released (but not beforehand!), we're going to have to move towards
> > having all man pages encoded in UTF-8. For most languages this probably
> > isn't too bad: you just use de_DE.UTF-8 rather than de, or whatever
> > (although I'm not sure how that'd work for languages with multiple
> > regional variants). It's going to be a royal pain for English, though,
> > because currently we just put things directly in /usr/share/man, meaning
> > the C locale, and there's no C.UTF-8, probably for good reasons.
> > en_US.UTF-8 would be a poor choice because we also need en_GB.UTF-8 and
> > so on.
> 
> AIUI, only ASCII is valid in the C locale anyway. Setting the top bit
> is an error.

True enough, and the FHS says:

       For example, systems which only have English manual pages coded with
       ASCII, may store manual pages (the man<section> directories) directly in
       /usr/share/man.  (That is the traditional circumstance and arrangement,
       in fact.)

It's sometimes a pain to be quite that strict, though (e.g. authors'
names), so I wouldn't object to people using ISO-8859-1 accents like
\('e in /usr/share/man/man*, just as long as they're coded thus rather
than in raw ISO-8859-1.

It doesn't seem to be all that prevalent anyway. A quick script [1]
checking just /usr/share/man/man1 on my system only shows 28 out of
1697, and at least one of those is a false positive due to some
non-ASCII characters that are only in comments:

  /usr/share/man/man1/esd-config.1.gz
  /usr/share/man/man1/ethereal.1.gz
  /usr/share/man/man1/filterm.1.gz
  /usr/share/man/man1/flipdiff.1.gz
  /usr/share/man/man1/font2psf.1.gz
  /usr/share/man/man1/grolbp.1.gz
  /usr/share/man/man1/html2text.1.gz
  /usr/share/man/man1/ispell-wrapper.1.gz
  /usr/share/man/man1/konwert.1.gz
  /usr/share/man/man1/magic2mime.1.gz
  /usr/share/man/man1/mini-dinstall.1.gz
  /usr/share/man/man1/mmroff.1.gz
  /usr/share/man/man1/mp4h.1.gz
  /usr/share/man/man1/pbmtonokia.1.gz
  /usr/share/man/man1/perlcn.1.gz
  /usr/share/man/man1/perlebcdic.1.gz
  /usr/share/man/man1/perlhack.1.gz
  /usr/share/man/man1/perlhist.1.gz
  /usr/share/man/man1/perljp.1.gz
  /usr/share/man/man1/perlko.1.gz
  /usr/share/man/man1/perlothrtut.1.gz
  /usr/share/man/man1/perlthrtut.1.gz
  /usr/share/man/man1/perltw.1.gz
  /usr/share/man/man1/ppmshadow.1.gz
  /usr/share/man/man1/ptx.1.gz
  /usr/share/man/man1/thumbpdf.1.gz
  /usr/share/man/man1/unicode.1.gz
  /usr/share/man/man1/unwrapdiff.1.gz

[1] find /usr/share/man/man1 -type f | while read x; do zcat "$x" | \
      iconv -f ISO-8859-1 -t US-ASCII >/dev/null 2>&1 || echo "$x"; done

-- 
Colin Watson                                  [cjwatson@flatline.org.uk]

Reply to:

References:
- Re: manpage character cleanup for UTF-8 compatibility
  - From: Colin Watson <cjwatson@debian.org>
- Re: manpage character cleanup for UTF-8 compatibility
  - From: Andrew Suffield <asuffield@debian.org>

Prev by Date: Re: Bug#176178: handling open security problems in woody with the BTS (here: the kernel)[was: Re: Bug#176178 acknowledged by developer (do not reopen)]
Next by Date: Re: Announcing a Debian wallpaper package.
Previous by thread: Re: manpage character cleanup for UTF-8 compatibility
Next by thread: Re: Announcing a Debian wallpaper package.
Index(es):
- Date
- Thread