[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Possible?! A Debian public repository for all complex code lines with examples and scripts?



On Fri, Mar 26, 2021 at 10:46:23AM -0500, David Wright wrote:
> On Fri 26 Mar 2021 at 19:11:24 (+0530), Susmita/Rajib wrote:
> > It is clearly noticed that wide applications of tricks with wildcards,
> > regex and redirections aren't simply available in the man pages.

Nor should they be.  The man page should document how the program
functions, but should not tell you *every* possible way you can use
the program.

You seem to be focusing on the shell, so take a look at the bash man
page.  The man page for bash 5.1 as distributed in bullseye is almost
6400 lines long (using standard 80-character lines).

If you wanted to include a tutorial, or a programmer's guide book,
inside of this man page, it would be even larger.  And it's already
a tremendously large document.

People have written whole books on shell programming.  (Whether those
books are any good is a separate question.)  It's a huge topic.  It's not
something you can just tack on to the end of a man page.

> Correct. The man pages document fact that are specific to particular
> commands. Wildcards, regex and redirections are features of the shell
> that invokes them, and so are documented there.

This is only partly true.  "Wildcards" (shell matching patterns,
traditionally known as "globs") are indeed implemented at the shell
level, and are documented in the shell's manual.  However, these globs
were so well received that they were also implemented outside of the
shell.  There are two C library functions -- fnmatch(3) and glob(3) --
which describe the C library's implementation for pattern matching
and filename expansions, respectively.

The libc implementation is a little bit different from bash's
implementation, which in turn is a little bit different from dash's
implementation.  But the most basic features are the same.  Many programs
use fnmatch(3) or glob(3) or both, in order to maintain some level
of compatibility with how the shell does pattern matching or pathname
expansions.

Regular expressions have a completely different lineage.  They were
developed as part of computer science theory back in the 1960s, but
the ones we know and love were originally written in various Unix tools
such as ed(1), grep(1) and awk(1).  Back in the 1970s, each of these
tools had its own separate regular expression engine, so they all had
slightly different feature sets and syntax.

Around the 1990s, people decided it would be a lot more sensible to
share and standardize the regex engine across the various tools, so
that for example grep(1) and sed(1) would both support the same
expressions.  But some of the tools were a little too different from
each other for there to be just one regex engine.  Eventually a
compromise was reached, and Unix (POSIX) standardized on two regex
languages: BRE (Basic Regular Expressions) and ERE (Extended Regular
Expressions).

sed(1), grep(1), ed(1) and some other programs use BRE.  Or at least
they're supposed to.

awk(1), egrep(1) a.k.a. grep -E, and some other programs use ERE.

The engine that supports these two types of regular expression is
implemented in the C library, and is documented in regex(7) and regex(3).

Bash uses ERE, but only in one place: the =~ operator in the [[ command.
Bash uses the C library's implementation for this, rather than trying
to write its own engine.  Pretty much everything else that bash does
uses globs.

The GNU implementations of sed and grep, which are supposed to use
Basic Regular Expressions, actually use their own special regex engine
with their own special extensions.  The effect of this is that people
who only learned Linux, not Unix, often write scripts that use the GNU
extensions, and therefore do not work on any other Unix-type systems.
You'll want to watch out for that.

There are several other regular expression engines, which go beyond
the two flavors standardized by POSIX.  The most common of these is
undoubtedly perl's engine.  It implements a great number of extensions
to the regular expression language, and has been around for decades.
A mostly-compatible clone of it called PCRE (Perl-Compatible Regular
Expressions) was spun off and is implemented as a C library.  Some
programs use it.

Tcl has its own extended regex language, which it calls ARE (Advanced
Regular Expressions).  It's not as popular as perl's, but it does have
some of the same features.

I'm sure there are a bunch of other flavors floating around out there
as well.


Reply to: