At 2023-10-14T20:51:27-0600, Antonio Russo wrote:
> I discovered a new pet peeve today: if you search for a command in a
> manual page, say -e in man 1 zgrep, it's a crapshot whether just
> searching for '-e' will find the command or not.  The reason is that
> "-" may been accidentally encoded as ‐ instead of -.
You can blame me for this.
https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1.23.0#n206
...me, and man page authors who don't think about whether they intend
a hyphen or a minus sign when they strike the '-' key...
Quick background: in the context of Unix usage as documented by
nroff/troff, the dash used at the shell prompt, in text editors, and in
programming language source code is a "minus sign".  troff has an em
dash special character as well since the mid-1970s; groff adds an en
dash as well, and furthermore supports user definition of characters
providing access to any other sort of dash that comes down the Unicode
pike.  (Not that doing so is a good idea in a man page; see below
regarding a "restricted dialect" of man(7).)
> Now, depending on your email client and settings, the above will
> appear to be the ravings of an unhinged lunatic who wrote the same
> thing twice, or an unhinged lunatic who slammed their fists onto the
> keyboard.
This issue does indeed have a history of provoking unhinged lunacy.
Before we proceed, you might wish to be aware of
<https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041731> and its
proposed remedy.
> The reason is that man(1) convert bare dashes (0x2D) to hyphens
> (U+2010).  These are not the same symbol: searching for one does not
> find the other without some kind of normalization, pasting commands
> with one vs. the other does different things.  New users who do not
> understand this will be discouraged trying to read manual pages.
> Chances are, they will fill forums with mundane questions that could
> and should have been addressed by a simple search of a manual page.
I run into this problem, too, since I dogfood my own changes.  When
irritated by this, I try the search again, replacing '-' with '.', which
has yet to fail me (and produces false positives surprisingly rarely).
For example, I've recently been playing with the mg(1) editor, and
observed extremely poor discipline in this area.  So I forked it on
GitHub and have been preparing a bunch of revisions.  I wrote a sed
script to fix its numerous hyphen/dash problems.[1]
> I recently fixed a ton of these in another upstream package with this
> vim "one-liner":
> 
> :%s/--\([a-z]\+\)\(-[a-z]\+\)*/\=substitute(submatch(0), '-', '\\-', 'g')/g
My Vimscript is not very sophisticated, but it looks like you're
replacing only hyphens that appear in long option names here.  That's
good, as you're unlikely to clobber any hyphens that should _not_ become
minus signs.
Such discernment is important.  Many people who want to "solve" this
issue forget (or ignore) that not every '-' is a minus sign.  Some are
actual hyphens, as in "long-term effects" and "word-aligned struct
members".  Trying to infer a distinction from white space adjacency also
won't work.  Consider the phrases "word- or byte-sized caching" and
"object-based vs. -oriented programming".  While sophistication with
compound hyphenated affixes is seldom seen in man pages, we most often
find it where a man page author has taken considerable care with their
technical writing.  Such pages are less likely than most to require
revision with blunt instruments like regular expression-based global
search and replace operations.
> However, this requires manual review
Surprisingly often, the composition of high-quality technical
documentation requires the engagement of a human brain.
> and does not fix the '-e' example from zgrep.
Mapping all hyphens and minus signs to a single character, as people
whose blood pressure spikes over this issue tend to promote as a first
resort, is an ineluctably information-discarding operation.  In my
opinion, man page source documents are not the correct place to discard
that information.
(I acknowledge that you didn't propose such a crude remedy; I write to
anticipate the inevitable follow-ups from people who will.)
Doing so at rendering time is much more defensible, and happens anyway
on devices that do not distinguish these characters in the first place.
> There are also a whole host of this kind of problem, e.g., dashes in
> URLs that get naievely pasted into man pages (another live example I
> just addressed).
Yes, people commonly type URLs and email addresses into man page sources
as they would into an MUA or browser navigation bar.  Since U+2010 is
difficult to encode in such things, the man(7) package could help by
performing an automatic character translation in this area.  However,
(1) no one's actually asked for this and (2) it would address only a
tiny part of the problem.  The means of "help" I have in mind is
employment of the groff man(7) extension macros `UR`/`UE` and `MT`/`ME`,
which remain under-used even after 14 years in release.  I might like to
think that offering such a provision would encourage their adoption, but
I can't honestly adopt that position.  I don't see another good way to
perform the transformation, because these are "semantic" macros
imputing enough meaning to the material they bracket that we know we can
safely do so.
> I come here with several questions:
> 
>  - Am I off-base thinking this is a problem?
No--it's a problem, but I might not locate it in the same place(s) you
do.
>  - Should we really be using troff to typeset anything in this year
>    2023?
I'm conflicted out on this question.[2]  Keep in mind that the
distinction between hyphens and minus signs is actually _important_ to
people doing _typesetting_, as opposed to reading man pages on
terminals, perhaps in haste and under deadline stress in a workplace.
>    (In particular, if we can make the source text more human-readable,
>    we might be able to leverage LLMs on this wealth of information in
>    the future and automate support.  Are LLMs "fluent" in troff? I
>    have not investigated at all.)
I am not an expert in LLMs, but man(7) is a macro package for the
roff(7) language, and roff(7) is Turing-complete.  Thus, in principle,
to know even what is being rendered as text, one is faced with a
challenging decidability problem.  (This is why "deroff" and "unroff"
tools confess their limitations, and seem always to fall out of use.
Also "groff -a" is very nearly what you want anyway.)
On the bright side, mandoc(1) maintainer Ingo Schwarze and I have put
considerable effort into defining and promoting a restricted dialect of
man(7) that is much more amenable to automated processing of all kinds.
That we do so for different reasons (he maintains a bespoke *roff
interpreter and wants to implement as few features as possible; he also
strongly advocates use of the mdoc(7) macro package over man(7); I on
the other hand want the language of man page composition to be as small
as possible to ease its acquisition and mastery while getting a few
nines of the task done) fortunately doesn't frustrate our cooperation.
The groff_man_style(7) man page in the version of groff to which you
recently upgraded is the fruit of much effort in this area.
>  - Are there any alternatives that actually produce nice looking man
>    pages?
Many tools produce acceptable looking man pages when _rendered_
(depending on your standard of good typography).  The production of
man(7) source that is idiomatic enough to be maintained in that form, or
even comprehended well enough to drive debugging and development of the
conversion tool, is another question.  Perl's pod2man/podlators is
probably the best of breed here, still does not match the cleanliness of
a document drafted by a human author with a good command of the macro
language.
At the other end of the spectrum is docbook-to-man, which seems to be
reviled not only by every practitioner of roff/man that encounters it,
but which also seems to poison everyone who attempts to maintain it.
>    (My experience with pandoc is that the source is still awkward, I
>    literally just found another example of this bug in my own man
>    page, and it looks pretty ugly in man. But maybe I just didn't find
>    good examples/documentation.)
pandoc has recently seem some improvements in its man(7) generation.
I've worked fruitfully with its upstream in this area.  Feel free to Cc
me with respect to any further revisions you'd like to pursue there.
>  - Should we try to come up with some lintian rules to flag this
>    behavior?  (This one: /--\([a-z]\+\)\(-[a-z]\+\)*/ finds long
>    GNU-style commands, I'd have to think for at least a little bit
>    about finding short ones.  This would ultimately be fragile. For
>    example, the above doesn't find partially broken tokens; i.e., only
>    one unescaped dash.)
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051357
> <li> Automated tooling around this, more generally, seems fragile.
>    HTML might have been a nice compromise, but writing that appears to
>    be out of vogue these days, <sarcasm intensity="medium">despite
>    being a pretty OK thing to read and write by hand</sarcasm>.</li>
>    But seriously, I would love to be writing HTML instead of troff for
>    manual pages.
If you want man pages to look the way they traditionally have since Unix
Version 7 (1979), this is a bigger challenge to achieve with HTML than
you might suppose.  If you want only a rough approximation thereof, my
guess is that there are many straightforward and valid approaches one
could take.  The challenge would then be in persuading others to adopt
your one obviously optimal solution.  https://xkcd.com/927/
Those who want to know why this exasperating issue arose in the first
place, I refer to section "History" of groff_char(7).  The arrival of
the Unicode character set in terminal emulators echoed the delivery of
the Graphic Systems C/A/T phototypesetter to the Bell Labs Computing
Science Research Center in about 1972; the problems that came along were
similar.
Regards,
Branden
[1] Here's part of one commit message.  I haven't pushed any commits to
    my mg fork yet.  Long story short, that man page has a lot of
    problems even apart from this one, both from a technical writing
    perspective and from that of mdoc(7) competency, which I find
    noteworthy in light of the stridency of *BSD community partisanship
    on the question of man(7) vs. mdoc(7).  But, having met Charles
    Hannum (a NetBSD founder) in person at the Atlanta Linux Showcase
    nearly 25 years ago, I can't say wasn't prepared.
    Also, I did not bother to tune this sed script for efficiency,
    cleverness, or to show off my command of the language.[3]  I did not
    undertake it for its own sake.  I built it up by whacking at errors
    until none remained.
    I share this to illustrate the impotence of a crude approach to
    solving this problem.  For example, the character sequence
    "read-only" is sometimes used in prose as an adjective and sometimes
    as an Emacs command literal.  The former should keep a hyphen; the
    latter should get a dash.  Deciding which one demands a higher climb
    up the Chomsky language hierarchy than a text editor generally
    offers.  The solution exists between the keyboard and chair, but I
    guess that's where the resentment of solving it at all arises too.
--begin snip--
    I produced the change with the following sed script.  This process
    exposed many failures to use the mdoc `Ic` macro when it was
    warranted; had it been employed with discipline, this script would
    be shorter.
    /^\.Nd/b
    /^\.Bl/b
    /^\.Bd/b
    \# skip exceptions
    /opened read-only/b
    /window-specific/b
    /buffer-specific/b
    /working-directory/b
    /non-incremental/b
    /are read-only/b
    /are self-explanatory/b
    /extended-ascii/b
    /two-line/b
    /Set case-fold/b
    /mini-buffer/b
    /Toggle the read-only/b
    /global read-only/b
    /terminal-specific/b
    /8-bit/b
    /Multi-byte/b
    s/-/\\-/g
    \# put these back
    s/Control\\/Control/g
    s/Meta\\/Meta/g
    s/an auto\\-execute/an auto-execute/
    s/Toggle auto\\-fill/Toggle auto-fill/
    s/mail\\-mode/mail-mode/
    s/non\\-whitespace/non-whitespace/
    s/Self\\-insert/Self-insert/
    s/KNF\\-compliant/KNF-compliant/
    s/keyboard\\-invoked/keyboard-invoked/
--end snip--
[2] https://www.gnu.org/software/groff/manual/groff-man-pages.pdf
[3] On a positive and cool note, the following remains unsurpassed, to
    my knowledge as the coolest, cleverest thing ever done in sed(1).
    https://sed.sourceforge.io/local/scripts/dc.sed.html
    (Now that I've said that, someone can tell me that they've
    implemented an RV32E core in sed...)
Attachment:
signature.asc
Description: PGP signature