[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: detect shell script language



Some general comments about how I understand this stuff to work, a bit long, perhaps, but to be sure we're all on the same starting line. Also, due to being out of the office, I missed some of the earlier mails and so will most likely repeat some things that have already been said. My apologies in advance for the possibly excess verbiage.

1. the she/bang (#!) first line. This is a 'magic' value, and is used by the system's 'exec' family of system calls to determine how to execute the file. In the 'old' days (I'm familiar with AT&T version 7 UNIX, for this), when a user typed in a command, the interactive shell would immediately pass it off to 'exec' to execute. This is fine for binary executables, but would fail on a script (or any text type) file. So, on return from exec with an error status, the shell would fork a copy of itself to try and run the script.

1a. As a result of the above, it was hard to tell whether the script was a Bourne shell (sh) or C shell (csh), so the convention was introduced of using the Bourne shell no op command (:), as the first line in a Bourne shell script. This convention can still be found in Oracle's Bourne scripts, even as recent as Oracle 10.2 for Solaris (a Linux Oracle 10.2 install has mostly she/bang format, but at least one came up with a colon character on the first line).

2. Similar tactics are used by Perl and Tcl/Tk (tclsh/wish) to cause execution of the correct interpreter. This is based on the fact that, at least for the Bourne shell and its derivatives, execution and script validation are essentially concurrent. So, the shell never even gets to the line following the 'exec' line. The exec is a perfectly legal script command, which causes the desired scripting language engine (perl, tclsh, wish) to get run. Since these languages allow script commands to cover multiple lines, they both see an 'if' test that fails, so they never execute the 'exec', and proceed to interpret the rest of the script.

3. Modern Bourne derived shells are designed to be as compatible as possible with the original 'sh', so there is no easy way to differentiate between them. The same applies to any shell derived from 'csh' (tcsh, etc.). zsh, on the other hand, is a beast I know little about, but based on the man page, it appears to be a Bourne compatible shell. In any case, the highlighting for these should be the same anyway, so no sweat over differentiating them is needed.

Enough background.

Since all modern UNIX/Linux systems support the she/bang functionality, I think you'll find your best option is to use it to begin with. But it would be a good idea, I believe, to also look for that archaic ':' as the first character of a file (the file command reports these as 'shell archive or script for antique kernel text'). A suggestion in one of the emails I did see, to use 'file' to help sort things out, is a good idea, as the command is pretty good at sorting things out (of course, as you have access to source for 'file', you may be able to use it to incorporate the file command's heuristics directly in your code). But this is not a panacea, 'file' can be confused. A file with this content:

  exec "/usr/local/bin/perl" $0

and with execute permission set, will run (legal shell code, but illegal Perl, so there's an error from Perl about it). And 'file' just calls it an 'ASCII text file'.

And the above is no help for the cases mentioned in paragraph 2. Looking for a line with 'exec' alone is not enough, you would need to check to see if the text following it looks like a command to execute. This is because Bourne style shells allow you to open/close/reopen files and file descriptors using 'exec', for example:

  #!/bin/sh
  exec 3<message.file 4>errors.out
  ...
  echo error condition >&4

  while read input
  do
  done <&3

And, of course, there are the special cases where the script is for two interpreters. I use this to first run a shell script environment to set things up for Perl (ORACLE_HOME, LD_LIBRARY_PATH, etc) for different systems (Linux, Solaris, Cygwin), and then do an 'exec $PERL' at a later point (around line 90, IIRC). So, in this case, most of the script is perl code, but it starts out a shell code. And the 'exec' line uses a variable, so it's not clear from just the line what is being exec'd.

But, now, enough is enough, I hope this is helpful to you in figuring out what you need to do and to perhaps point out some of the pitfalls to watch out for.

Good luck,

Bob

Lorenzo Bettini wrote:
Maxim Vexler wrote:

(I'm thinking out loud here)


go ahead :-)

How about identifying patterns specific to each shell, and then
implementing an algorithm that would produce score for each shell
match. The one with the highest score will be the one used by
src-highlite. This perhaps should be a standalone utility/lib, a fact
that would allow it to be used in other implementation besides
src-highlite.


indeed I was thinking about something similar; the problem is that I should restrict it to shell scripts, since otherwise I should check against all the possible language handled by source-highlight and that would be inefficient.

I should know more about script languages though, which is not the case (shame on me! ;-). However, I was thinking also of letting the user provide his own regular expressions to detect a language, and that could be then enjoyed also by other users.


BTW, src-highlite is great. Thank you Lorenzo for adding another tool
to my already unbelievably huge free software tools arsenal.


WOW!  Thank you!  :-D

I'll let you know when I release this new version of source-highlight!

And by the way, if you use some language which is still not handled by source-highlight, and would like to add it, please let me know and we can work it out!

cheers
    Lorenzo




Reply to: