[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Hyphens in man pages



Hello,

I discovered a new pet peeve today: if you search for a command in a manual page,
say -e in man 1 zgrep, it's a crapshot whether just searching for '-e' will find
the command or not.  The reason is that "-" may been accidentally encoded as ‐
instead of -.

Now, depending on your email client and settings, the above will appear to be the
ravings of an unhinged lunatic who wrote the same thing twice, or an unhinged
lunatic who slammed their fists onto the keyboard.

The reason is that man(1) convert bare dashes (0x2D) to hyphens (U+2010).  These
are not the same symbol: searching for one does not find the other without some
kind of normalization, pasting commands with one vs. the other does different
things.  New users who do not understand this will be discouraged trying to read
manual pages.  Chances are, they will fill forums with mundane questions that
could and should have been addressed by a simple search of a manual page.

I recently fixed a ton of these in another upstream package with this vim "one-liner":

:%s/--\([a-z]\+\)\(-[a-z]\+\)*/\=substitute(submatch(0), '-', '\\-', 'g')/g

However, this requires manual review and does not fix the '-e' example from zgrep.
There are also a whole host of this kind of problem, e.g., dashes in URLs that get
naievely pasted into man pages (another live example I just addressed).

I come here with several questions:

 - Am I off-base thinking this is a problem?
 - Should we really be using troff to typeset anything in this year 2023?
   (In particular, if we can make the source text more human-readable, we might
   be able to leverage LLMs on this wealth of information in the future and automate
   support.  Are LLMs "fluent" in troff? I have not investigated at all.)
 - Are there any alternatives that actually produce nice looking man pages?
   (My experience with pandoc is that the source is still awkward, I literally
   just found another example of this bug in my own man page, and it looks pretty
   ugly in man. But maybe I just didn't find good examples/documentation.)
 - Should we try to come up with some lintian rules to flag this behavior?
   (This one: /--\([a-z]\+\)\(-[a-z]\+\)*/ finds long GNU-style commands, I'd
   have to think for at least a little bit about finding short ones.  This would
   ultimately be fragile. For example, the above doesn't find partially broken
   tokens; i.e., only one unescaped dash.)
<li> Automated tooling around this, more generally, seems fragile.  HTML might have
   been a nice compromise, but writing that appears to be out of vogue these days,
   <sarcasm intensity="medium">despite being a pretty OK thing to read and write
   by hand</sarcasm>.</li> But seriously, I would love to be writing HTML instead
   of troff for manual pages.

Antonio

Attachment: OpenPGP_0xB01C53D5DED4A4EE.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


Reply to: