[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re[2]: OT: how to strip out SGML tags?



Ok. My last post in this thread! Here is what does work: (thanks list)

#!/usr/bin/perl -w
##
##  sgmlstripper - Strip SGML markup from input.
##
##  by Robert J Seymour <rseymour@rseymour.com>
##     Copyright 1995, 1996, Robert Seymour and Springer-Verlag.

## Fixed by Bob Bernstein to handle zeros., 9/2/2000

##     All rights reserved.  This program may be distributed and/or
##     modified in electronic form under the same terms as Perl
##     itself.
##
##  CPAN menu:
#
# File Name: sgmlstripper
# File Size in BYTES: 1469
# Sender/Author/Poster: Robert J. Seymour <rseymour@rseymour.com>
# Subject: sgmlstripper - Strip SGML markup from input.
#
# sgmlstripper removes SGML markup tags from input (taken through
# specified files or STDIN).  sgmlstripper uses a 
# character-by-character read mode which, though not as fast as a
# regexp, is guaranteed to strip tags which fall across line or
# paragraph boundaries and preserves whitespace so that line numbers
# will be the same (the latter is useful for search engines which
# don't want to index markup, but want line numbers to be preserved).


##  Use STDIN if no files are given
$ARGV[0] = "-" unless @ARGV;

##  Strip out anything contained in an SGML markup tag.  This is not
##  very pretty and rather inefficient, but it does take care of tags
##  which cross line or paragraph boundaries.
foreach $file (@ARGV) {
  open(INPUT,$file);
  while(!eof(INPUT)) {
     $char = getc(INPUT);
    if($char eq "<") {
      IGNORE: for(;;) {
        last IGNORE if (getc(INPUT) eq ">");
      }
    } else {
      print $char;
    }
  }
  close(INPUT);
}


--
Bob Bernstein                  http://www.ruptured-duck.com




Reply to: