[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Reasons to not use quote signs directly?



Hi!

On Wed, 2016-10-19 at 12:54:10 -0700, Russ Allbery wrote:
> Guillem Jover <guillem@debian.org> writes:
> > Using raw UTF-8 in the roff source is not portable, and some (most?)
> > implementations might not be happy about that. But using the escape
> > sequences should always be safe(?). (I've just verified at least on AIX
> > and Mac OS X systems.)
> 
> Internationalization of man pages has a bunch of irritating problems that
> come down to picking which non-portable problem you want to have.

Right. :/

> I know that eight-bit characters in *roff source caused serious problems
> (segfaults, etc.) on very old *roff implementations on proprietary UNIXes
> (Solaris 2.4, that sort of thing), which is why I've always avoided using
> that approach with the output of pod2man without a special flag (-u).  But
> I'm not sure it makes sense to still be that cautious, and the default
> output of pod2man is awful (replacing all non-ASCII characters with X,
> which just isn't acceptable any more).

Yeah the Xs were really annoying. On the AIX and Mac OS X systems I
tested on, AFAIR they produced garbage when rendering, but I can recheck
to be sure. I think I might have also tested on a system that used man
(w/o Unicode support) instead of man-db, but I'd need to reverify. And I
think the various BSDs use groff, but it might need checking too.

> Various people have asked for a groff macro output mode, and I think that
> would be a fine idea, except that it requires some effort to build the
> large table of Unicode code point to groff macro mappings.  I'm not sure
> if it makes sense to have that be the default output mode or to have raw
> Unicode be the default output mode (I want to get rid of the current
> default).  It sounds like from your portability investigation that using
> groff macros as the default output mode might work, which is valuable
> information!

Just to clarify (because I think I was a bit vague previously), on
systems that didn't support Unicode using the groff macros produced no
output (so no garbage), which is better IMO than the Xs or garbage. :)

For the current conversion in dpkg, I've taken most of the common
symbols from groff_char(7) and created a very simple sed script, I'm
not sure if you were thinking about something along those lines
(although in proper perl)?

  <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/tree/man/utf8toman.sed?h=next/master&id=c07b9b79447e200645ea423f959194fcbf8d4d32>

> Needless to say, if anyone wanted to put together the mapping table to
> enable that, I would be very interested.  I'll add it to my personal to-do
> list, but that's quite long and the time I have available to work on free
> software at the moment is sadly limited.

If you could specify exactly which symbols you'd like to see supported
I might take a stab at this, when I have some spare time. Say
everything in groff_char(7) or similar. :)

> > But coming back to the source code, yes, I pretty much agree that roff
> > can be very noisy and non-readable, to the point I've actually gotten
> > bothered enough to check for possible alternatives this last month. The
> > problem is finding a format that is clear, expressive enough, supported
> > by po4a, does not require huge Build-Depends and produces portable and
> > nicely formatted man pages. The obvious candidate is perl's POD, because
> > we are already using that for the perl modules and require perl to
> > build.
> 
> > But I've found some quirks and issues that while not unsurmountable,
> > might need to be looked at first and perhaps fixed or workarounds found
> > to avoid "regressions", and I'm not sure which ones Russ would be happy
> > to get bug reports for? :)
> 
> I'm definitely happy to get bug reports!  I do try to slowly work through
> issues like this (for instance, I've now added separate flags to control
> the left and right quote marks, from a bug report you filed quite some
> time ago).  Obviously, patches make things even faster, and I'm slowly
> trying to modernize and improve the coding style of the podlators code,
> although it's a rather long process.

Ok, noted! Then I'll start filing reports upstream.

> > I'm attaching a PoC conversion (can be tested with «pod2man
> > deb-symbols.pod|man -l -», and is available also from [G]) and here's a
> > list of potential differences/issues:
> 
> >   - References are in italic not bold.
> 
> I can change this (a bug report to remind me to do so is very welcome).
> For the record, italics actually used to be the correct convention
> somewhere (I know I didn't make that up), probably Solaris since I took a
> lot of the conventions from there, but I see that man-pages(7) now
> recommends bold.  This is one of those things that was never standardized,
> but at this point I think the Linux man-pages Project is sufficiently
> widespread and authoritative that, as long as it's not in complete
> disagreement with BSD, I'm happy to go with their conventions.
> Particularly over old Solaris conventions, since Solaris is now mostly
> dead.

Perfect.

> >   - Does not map ‘’, “”, and other UTF-8 quotes to roff escape sequences
> >     (or have to use non-portable --utf8 option).
> 
> See above for a rather extended discussion of that.

Yeah.

> >   - Needs raw roff for some formatting, as POD is not expressive enough
> >     (this will have to do with «=begin man» as pod2man cannot change
> >     the POD syntax anyway).
> 
> Yes.  POD is sadly a somewhat limited syntax, and while there was a Perl 6
> take on POD that was trying to expand it, I don't think it ever caught on.
> These days, everyone seems to have switched to Markdown or reStructured
> Text, which certainly have their merits but which don't seem to be good
> fits for man page generation.

Right, and exactly might thought on the other formats.

> So, for things like tables, you're probably going to need to continue to
> escape to raw *roff with =begin man.

I don't think there are any tables in the dpkg man pages. But see my
other mail.

> >   - Many minus signs are output as hyphens (for example for field names).
> 
> This is a nasty problem, since POD has no explicit markup for this and one
> has to use heuristics.
> 
> Improvements in the heuristics are certainly welcome.  This is the current
> code:

[…]

> As you can see, it's a bunch of messy and rather fragile regexes.  But
> there is a test system, so I'm happy to tweak these and add more tests if
> you have specific use cases that you encounter.
> 
> The trick is going to be distinguishing between hyphenated English words
> (which should use the unmarked - character in *roff source) and field
> names where you want an explicit \- minus sign.  Although I could see an
> argument for just supporting disabling this heuristic if one doesn't care
> about good line wrapping.

I guess field names might be easy to spot as they have the standard
form Field-Name(-Other)* which is probably not common for English words?
This might trip over on other languages such as German for example which
tends to capitalize many words.

The other major issue are commands, which I'm not sure are so easy to
detect. Maybe they could get to use the \- minus if they are inside some
other markup. I see that C<some-command> escapes them, as does
L<some-command(1)>, but L<some-command> does not (any reason?), which
could be handy to use I guess. Filenames are also safe with
F</some-dir/file-name>. The only problem is using the proper markup
that also preserves the same output as the current man pages.

Now that I check, E<0x2D> could be used, but that would seem atrocious,
and a step back in readability.

> >   - Default for pod2man is no justified text.
> 
> This is a (very strong) personal preference, since I think most man pages
> are read on terminals with fixed-width fonts, and I think justified text
> looks awful in a fixed-width font.  But I'd be happy to add a non-default
> flag that suppresses the turning off of justification.

Yeah, I saw the rationale, it makes sense, and I might eventually
switch to that. It was mostly an obvious difference I spotted,
something I'm visually used to by now, and a matter of trying to get
a 1:1 conversion possibly w/o any visible differences. That's why I
tried to carfully mention these as a list of issues and simply
differences, which might not be problematic at all. :)

> >   - The license blurb is only present as a comment on the source.
> 
> Yeah, I've given up on this and just put the license in a section of the
> output of the man page, but I think it would be lovely to put it in a
> comment.  This probably requires some sort of =for license block (and I'm
> not sure what Pod::Text should do with it -- just suppress it entirely, I
> guess).  I'm happy to add support for this (obviously, patches even more
> welcome).

I've always found the AUTHORS, COPYRIGHT or LICENSE sections to be
distracting, and in dpkg we got rid of all of them, because in
addition they were getting usually out-of-sync with the actual
copyright statements, and required adding names and updating years in
two places.

Given that the generated output states so clearly in its header, I
guess I was content to leave it at that. But I might take a look at
adding support for a «=for license» block, that seems like a nice
idea!

Thanks,
Guillem


Reply to: