Bug#440420: [PROPOSAL] Manual page encoding

To: Jens Seidel <jensseidel@users.sf.net>
Cc: "Giacomo A. Catenazzi" <cate@debian.org>, 440420@bugs.debian.org, debian-i18n@lists.debian.org
Subject: Bug#440420: [PROPOSAL] Manual page encoding
From: Colin Watson <cjwatson@debian.org>
Date: Tue, 4 Sep 2007 13:35:48 +0100
Message-id: <[🔎] 20070904123548.GM6091@riva.ucam.org>
Reply-to: Colin Watson <cjwatson@debian.org>, 440420@bugs.debian.org
In-reply-to: <[🔎] 20070904120432.GA12872@imkf-pc073.imkf.tu-freiberg.de>
References: <[🔎] 20070901120232.GB18492@riva.ucam.org> <[🔎] 46DC2A62.50402@debian.org> <[🔎] 20070903164719.GE6091@riva.ucam.org> <[🔎] 46DD1D67.6020906@debian.org> <[🔎] 20070904105256.GK6091@riva.ucam.org> <[🔎] 20070904120432.GA12872@imkf-pc073.imkf.tu-freiberg.de>

On Tue, Sep 04, 2007 at 02:04:32PM +0200, Jens Seidel wrote:
> On Tue, Sep 04, 2007 at 11:52:57AM +0100, Colin Watson wrote:
> > Thanks. I hope that my comments above clarify some further confusion. I
> > would still appreciate concrete information and examples on why you
> > don't like the idea of manual pages being installed in UTF-8 (noting
> > that as a package maintainer or a translator you wouldn't have to
> > actually edit it in that encoding if you didn't want to, it doesn't have
> > to be done urgently or on any kind of flag day, I have addressed the
> > non-Latin concern above, and it will not have a negative effect on users
> > of non-UTF-8 locales).
> 
> Is it save to use UTF-8 characters if a very similar character exists in
> ASCII or can be expressed using groff macros? Think about the many
> dashes which exist in typography. Is it OK to use a UTF-8 hyphen sign
> instead of \(hy (same for en-dash, em-dash, ...) especially as the
> ordinary minus "-" is very similar in the output?
> 
> Will man-db support all kind of white spaces (such as &nbsp;) ...?

You'll need to use the characters documented in groff_char(7) for this,
at least for the time being. See below.

> Of course there exist transliterations of all these characters I'm
> currently talking about but it would probably make the live easier to
> restrict to ASCII if possible, right?

I do appreciate that there are a few gotchas here. I think it is unduly
cumbersome to express all non-ASCII alphanumeric characters using groff
named characters, though. That option has been available for ages and
translators have generally not taken advantage of it; I can entirely
understand why not.

> Isn't there not also more than one way to express accented characters
> such as ä (as a single character and as "'a' followed by accent"?

groff 1.19 supports full Unicode-style composite glyphs, but the version
we have doesn't (see the comment in my original bug report about groff
versioning). Both our version and newer versions support named
characters such as \[:a] or \(:a (variant spellings), again documented
in groff_char(7). There's also the \N escape which can give you
font-dependent numbered glyphs, which are Unicode codepoints if you
happen to know that the utf8 device is in use.

As above, though, these have been available and translators generally
haven't used them; I can imagine that they're insanely cumbersome to use
in practice for e.g. Japanese. So I'd really rather just support plain
UTF-8 input for alphanumerics, which I think will actually get used.

Do you think we will need explicit language in policy for this? For the
time being, until we have a version of groff supporting direct UTF-8
input, the implementation will require that the page be convertible to
the legacy encoding for that language using iconv (it'll use 'iconv -c'
so that unknown characters are dropped rather than breaking the whole
page, but all the same): so e.g. for German pages characters without a
direct equivalent in ISO-8859-1 should be avoided. This seems like a
reasonable thing to document after man-db 2.5.0, and would cover things
like UTF-8 hyphen characters.

I'm not sure how groff will handle such characters once it does have
UTF-8 input support. I suspect it would convert U+2010 to its internal
"hy" glyph and render that in whatever way is appropriate for the output
device; that would really be ideal. However, I don't have enough
information to make a decision based on that guess.

In general, I think it's worthwhile for policy to make comments on
encoding for purposes of interoperability and standardisation, but I'd
be inclined to draw the line at filling it up with instructions on how
to use groff correctly. Does this sound reasonable?

Thanks,

-- 
Colin Watson                                       [cjwatson@debian.org]

Reply to:

Follow-Ups:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Jens Seidel <jensseidel@users.sf.net>

References:
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Colin Watson <cjwatson@debian.org>
- Bug#440420: [PROPOSAL] Manual page encoding
  - From: Jens Seidel <jensseidel@users.sf.net>

Prev by Date: Bug#440420: [PROPOSAL] Manual page encoding
Next by Date: Bug#440420: [PROPOSAL] Manual page encoding
Previous by thread: Bug#440420: [PROPOSAL] Manual page encoding
Next by thread: Bug#440420: [PROPOSAL] Manual page encoding
Index(es):
- Date
- Thread