Re: detect shell script language
Some general comments about how I understand this stuff to work, a bit
long, perhaps, but to be sure we're all on the same starting line.
Also, due to being out of the office, I missed some of the earlier mails
and so will most likely repeat some things that have already been said.
My apologies in advance for the possibly excess verbiage.
1. the she/bang (#!) first line. This is a 'magic' value, and is used
by the system's 'exec' family of system calls to determine how to
execute the file. In the 'old' days (I'm familiar with AT&T version 7
UNIX, for this), when a user typed in a command, the interactive shell
would immediately pass it off to 'exec' to execute. This is fine for
binary executables, but would fail on a script (or any text type) file.
So, on return from exec with an error status, the shell would fork a
copy of itself to try and run the script.
1a. As a result of the above, it was hard to tell whether the script
was a Bourne shell (sh) or C shell (csh), so the convention was
introduced of using the Bourne shell no op command (:), as the first
line in a Bourne shell script. This convention can still be found in
Oracle's Bourne scripts, even as recent as Oracle 10.2 for Solaris (a
Linux Oracle 10.2 install has mostly she/bang format, but at least one
came up with a colon character on the first line).
2. Similar tactics are used by Perl and Tcl/Tk (tclsh/wish) to cause
execution of the correct interpreter. This is based on the fact that,
at least for the Bourne shell and its derivatives, execution and script
validation are essentially concurrent. So, the shell never even gets to
the line following the 'exec' line. The exec is a perfectly legal
script command, which causes the desired scripting language engine
(perl, tclsh, wish) to get run. Since these languages allow script
commands to cover multiple lines, they both see an 'if' test that fails,
so they never execute the 'exec', and proceed to interpret the rest of
3. Modern Bourne derived shells are designed to be as compatible as
possible with the original 'sh', so there is no easy way to
differentiate between them. The same applies to any shell derived from
'csh' (tcsh, etc.). zsh, on the other hand, is a beast I know little
about, but based on the man page, it appears to be a Bourne compatible
shell. In any case, the highlighting for these should be the same
anyway, so no sweat over differentiating them is needed.
Since all modern UNIX/Linux systems support the she/bang functionality,
I think you'll find your best option is to use it to begin with. But it
would be a good idea, I believe, to also look for that archaic ':' as
the first character of a file (the file command reports these as 'shell
archive or script for antique kernel text'). A suggestion in one of the
emails I did see, to use 'file' to help sort things out, is a good idea,
as the command is pretty good at sorting things out (of course, as you
have access to source for 'file', you may be able to use it to
incorporate the file command's heuristics directly in your code). But
this is not a panacea, 'file' can be confused. A file with this content:
exec "/usr/local/bin/perl" $0
and with execute permission set, will run (legal shell code, but illegal
Perl, so there's an error from Perl about it). And 'file' just calls it
an 'ASCII text file'.
And the above is no help for the cases mentioned in paragraph 2.
Looking for a line with 'exec' alone is not enough, you would need to
check to see if the text following it looks like a command to execute.
This is because Bourne style shells allow you to open/close/reopen files
and file descriptors using 'exec', for example:
exec 3<message.file 4>errors.out
echo error condition >&4
while read input
And, of course, there are the special cases where the script is for two
interpreters. I use this to first run a shell script environment to set
things up for Perl (ORACLE_HOME, LD_LIBRARY_PATH, etc) for different
systems (Linux, Solaris, Cygwin), and then do an 'exec $PERL' at a later
point (around line 90, IIRC). So, in this case, most of the script is
perl code, but it starts out a shell code. And the 'exec' line uses a
variable, so it's not clear from just the line what is being exec'd.
But, now, enough is enough, I hope this is helpful to you in figuring
out what you need to do and to perhaps point out some of the pitfalls to
watch out for.
Lorenzo Bettini wrote:
Maxim Vexler wrote:
(I'm thinking out loud here)
go ahead :-)
How about identifying patterns specific to each shell, and then
implementing an algorithm that would produce score for each shell
match. The one with the highest score will be the one used by
src-highlite. This perhaps should be a standalone utility/lib, a fact
that would allow it to be used in other implementation besides
indeed I was thinking about something similar; the problem is that I
should restrict it to shell scripts, since otherwise I should check
against all the possible language handled by source-highlight and that
would be inefficient.
I should know more about script languages though, which is not the case
(shame on me! ;-). However, I was thinking also of letting the user
provide his own regular expressions to detect a language, and that could
be then enjoyed also by other users.
BTW, src-highlite is great. Thank you Lorenzo for adding another tool
to my already unbelievably huge free software tools arsenal.
WOW! Thank you! :-D
I'll let you know when I release this new version of source-highlight!
And by the way, if you use some language which is still not handled by
source-highlight, and would like to add it, please let me know and we
can work it out!