Re: Reasons to not use quote signs directly?

To: Helge Kreutzmann <debian@helgefjell.de>
Cc: debian-dpkg@lists.debian.org
Subject: Re: Reasons to not use quote signs directly?
From: Russ Allbery <rra@debian.org>
Date: Thu, 27 Oct 2016 17:10:59 -0700
Message-id: <[🔎] 87bmy5pc30.fsf@hope.eyrie.org>
In-reply-to: <[🔎] 20161019231407.irobeno7dlomodr5@gaara.hadrons.org> (Guillem Jover's message of "Thu, 20 Oct 2016 01:14:07 +0200")
References: <20160919163049.GA27815@Debian-50-lenny-64-minimal> <20160920235910.ewk2ejmlx2sghe7w@gaara.hadrons.org> <[🔎] 878ttk5d3x.fsf@hope.eyrie.org> <[🔎] 20161019231407.irobeno7dlomodr5@gaara.hadrons.org>

Guillem Jover <guillem@debian.org> writes:

> Yeah the Xs were really annoying. On the AIX and Mac OS X systems I
> tested on, AFAIR they produced garbage when rendering, but I can recheck
> to be sure. I think I might have also tested on a system that used man
> (w/o Unicode support) instead of man-db, but I'd need to reverify. And I
> think the various BSDs use groff, but it might need checking too.

Oh, okay, so proprietary UNIX is still a problem for just using Unicode
everywhere, but Linux and BSD may be okay.

> Just to clarify (because I think I was a bit vague previously), on
> systems that didn't support Unicode using the groff macros produced no
> output (so no garbage), which is better IMO than the Xs or garbage. :)

Still not great, though.  :(  Sigh.  So there's no silver bullet still.
But I think the scale has tipped at this point to the degree where it's
worth having good output with groff, even if that means one gets bad
output without groff.

> For the current conversion in dpkg, I've taken most of the common
> symbols from groff_char(7) and created a very simple sed script, I'm not
> sure if you were thinking about something along those lines (although in
> proper perl)?

>   <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/tree/man/utf8toman.sed?h=next/master&id=c07b9b79447e200645ea423f959194fcbf8d4d32>

Yeah, that would work, although aren't there quite a few more sequences
than that?  Does groff have a way of representing an arbitrary Unicode
code point?

For Pod::Man usage, the output format I'd want would be a hash mapping
Unicode code points to the correct groff escape.  Or, in an absolutely
ideal world, to have an Encode encoding for groff escapes, similar to how
the Encode::MIME::Header encoding works to generate RFC 2047 strings.

If groff doesn't have a way of encoding arbitrary Unicode code points,
what do you think Pod::Man should do with characters that don't have a
mapping (Chinese characters, for instance)?

> If you could specify exactly which symbols you'd like to see supported I
> might take a stab at this, when I have some spare time. Say everything
> in groff_char(7) or similar. :)

As much as possible is of course ideal, but I'm happy to take partial
work!  :)

> I guess field names might be easy to spot as they have the standard form
> Field-Name(-Other)* which is probably not common for English words?
> This might trip over on other languages such as German for example which
> tends to capitalize many words.

A bit tricky for, say, book titles, too.  :(

> The other major issue are commands, which I'm not sure are so easy to
> detect. Maybe they could get to use the \- minus if they are inside some
> other markup. I see that C<some-command> escapes them, as does
> L<some-command(1)>, but L<some-command> does not (any reason?), which
> could be handy to use I guess. Filenames are also safe with
> F</some-dir/file-name>. The only problem is using the proper markup that
> also preserves the same output as the current man pages.

B<> and I<> could just be surrounding normal words that should use normal
hyphens.  L<some-command> is a link to a section in the same document
entitled some-command, so the assumption there is also that it could be a
regular English word.

As you say, though, I'm not entirely sure the distinction is worth all the
trouble we've put into it over the years.  nroff at least seems to have
just given up and maps them all to "-" in the output anyway.  That used to
be a Debian-specific change, but it looks like upstream has switched to
treating - as \-, I think?  For HTML output, upstream maps \- to &minus;
and Debian still overrides that to - instead.  (If upstream thinks \- is a
minus sign and not ASCII 45, I'm really confused what's going on with
this, though.)

> I've always found the AUTHORS, COPYRIGHT or LICENSE sections to be
> distracting, and in dpkg we got rid of all of them, because in addition
> they were getting usually out-of-sync with the actual copyright
> statements, and required adding names and updating years in two places.

Yeah, that part is irritating.  The alternative, which I use in my
packages these days, is to have these reflect the authors, copyright, and
license of the *manual page*, but that's also weird.

=for license, resulting in a comment in the generated man page, seems like
a better general solution (and then it probably makes sense for this to
always reflect the license of the documentation file itself, not the
larger package).

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

References:
- Re: Reasons to not use quote signs directly?
  - From: Russ Allbery <rra@debian.org>
- Re: Reasons to not use quote signs directly?
  - From: Guillem Jover <guillem@debian.org>

Prev by Date: Re: Reasons to not use quote signs directly?
Next by Date: Heads-up: Switching internal dpkg arch representation to quadruplets
Previous by thread: Re: Reasons to not use quote signs directly?
Next by thread: Processing of dpkg_1.18.10~bpo8+1_amd64.changes
Index(es):
- Date
- Thread