[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Reasons to not use quote signs directly?



Guillem Jover <guillem@debian.org> writes:

> Using raw UTF-8 in the roff source is not portable, and some (most?)
> implementations might not be happy about that. But using the escape
> sequences should always be safe(?). (I've just verified at least on AIX
> and Mac OS X systems.)

Internationalization of man pages has a bunch of irritating problems that
come down to picking which non-portable problem you want to have.

groff macros are portable to various different levels of maturity around
Unicode handling... but not to *roff implementations other than groff, as
most of the macros used for Unicode characters seem to be groff inventions
not present in traditional UNIX *roff implementations.  Using Unicode
directly in the *roff source is probably more portable these days, since
groff seems to handle it acceptably and I suspect more *roff
implementations handle that than handle groff-specific escapes.  But I no
longer have access to a wide variety of traditional UNIX platforms to
check.

I know that eight-bit characters in *roff source caused serious problems
(segfaults, etc.) on very old *roff implementations on proprietary UNIXes
(Solaris 2.4, that sort of thing), which is why I've always avoided using
that approach with the output of pod2man without a special flag (-u).  But
I'm not sure it makes sense to still be that cautious, and the default
output of pod2man is awful (replacing all non-ASCII characters with X,
which just isn't acceptable any more).

Various people have asked for a groff macro output mode, and I think that
would be a fine idea, except that it requires some effort to build the
large table of Unicode code point to groff macro mappings.  I'm not sure
if it makes sense to have that be the default output mode or to have raw
Unicode be the default output mode (I want to get rid of the current
default).  It sounds like from your portability investigation that using
groff macros as the default output mode might work, which is valuable
information!

Needless to say, if anyone wanted to put together the mapping table to
enable that, I would be very interested.  I'll add it to my personal to-do
list, but that's quite long and the time I have available to work on free
software at the moment is sadly limited.

> But coming back to the source code, yes, I pretty much agree that roff
> can be very noisy and non-readable, to the point I've actually gotten
> bothered enough to check for possible alternatives this last month. The
> problem is finding a format that is clear, expressive enough, supported
> by po4a, does not require huge Build-Depends and produces portable and
> nicely formatted man pages. The obvious candidate is perl's POD, because
> we are already using that for the perl modules and require perl to
> build.

> But I've found some quirks and issues that while not unsurmountable,
> might need to be looked at first and perhaps fixed or workarounds found
> to avoid "regressions", and I'm not sure which ones Russ would be happy
> to get bug reports for? :)

I'm definitely happy to get bug reports!  I do try to slowly work through
issues like this (for instance, I've now added separate flags to control
the left and right quote marks, from a bug report you filed quite some
time ago).  Obviously, patches make things even faster, and I'm slowly
trying to modernize and improve the coding style of the podlators code,
although it's a rather long process.

> I'm attaching a PoC conversion (can be tested with «pod2man
> deb-symbols.pod|man -l -», and is available also from [G]) and here's a
> list of potential differences/issues:

>   - References are in italic not bold.

I can change this (a bug report to remind me to do so is very welcome).
For the record, italics actually used to be the correct convention
somewhere (I know I didn't make that up), probably Solaris since I took a
lot of the conventions from there, but I see that man-pages(7) now
recommends bold.  This is one of those things that was never standardized,
but at this point I think the Linux man-pages Project is sufficiently
widespread and authoritative that, as long as it's not in complete
disagreement with BSD, I'm happy to go with their conventions.
Particularly over old Solaris conventions, since Solaris is now mostly
dead.

>   - Does not map ‘’, “”, and other UTF-8 quotes to roff escape sequences
>     (or have to use non-portable --utf8 option).

See above for a rather extended discussion of that.

>   - Needs raw roff for some formatting, as POD is not expressive enough
>     (this will have to do with «=begin man» as pod2man cannot change
>     the POD syntax anyway).

Yes.  POD is sadly a somewhat limited syntax, and while there was a Perl 6
take on POD that was trying to expand it, I don't think it ever caught on.
These days, everyone seems to have switched to Markdown or reStructured
Text, which certainly have their merits but which don't seem to be good
fits for man page generation.

So, for things like tables, you're probably going to need to continue to
escape to raw *roff with =begin man.

>   - Many minus signs are output as hyphens (for example for field names).

This is a nasty problem, since POD has no explicit markup for this and one
has to use heuristics.

Improvements in the heuristics are certainly welcome.  This is the current
code:

    # By the time we reach this point, all hyphens will be escaped by adding a
    # backslash.  We want to undo that escaping if they're part of regular
    # words and there's only a single dash, since that's a real hyphen that
    # *roff gets to consider a possible break point.  Make sure that a dash
    # after the first character of a word stays non-breaking, however.
    #
    # Note that this is not user-controllable; we pretty much have to do this
    # transformation or *roff will mangle the output in unacceptable ways.
    s{
        ( (?:\G|^|\s) [\(\"]* [a-zA-Z] ) ( \\- )?
        ( (?: [a-zA-Z\']+ \\-)+ )
        ( [a-zA-Z\']+ ) (?= [\)\".?!,;:]* (?:\s|\Z|\\\ ) )
        \b
    } {
        my ($prefix, $hyphen, $main, $suffix) = ($1, $2, $3, $4);
        $hyphen ||= '';
        $main =~ s/\\-/-/g;
        $prefix . $hyphen . $main . $suffix;
    }egx;

As you can see, it's a bunch of messy and rather fragile regexes.  But
there is a test system, so I'm happy to tweak these and add more tests if
you have specific use cases that you encounter.

The trick is going to be distinguishing between hyphenated English words
(which should use the unmarked - character in *roff source) and field
names where you want an explicit \- minus sign.  Although I could see an
argument for just supporting disabling this heuristic if one doesn't care
about good line wrapping.

>   - Default for pod2man is no justified text.

This is a (very strong) personal preference, since I think most man pages
are read on terminals with fixed-width fonts, and I think justified text
looks awful in a fixed-width font.  But I'd be happy to add a non-default
flag that suppresses the turning off of justification.

>   - The license blurb is only present as a comment on the source.

Yeah, I've given up on this and just put the license in a section of the
output of the man page, but I think it would be lovely to put it in a
comment.  This probably requires some sort of =for license block (and I'm
not sure what Pod::Text should do with it -- just suppress it entirely, I
guess).  I'm happy to add support for this (obviously, patches even more
welcome).

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>


Reply to: