OT: how to strip out SGML tags?
I have found a perl script to do this but it doesn't seem to work.
Does anyone know of something that does?
Just for conversation's sake, here's the pl that does _not_ seem to handle my
DocBook SGML:
#!/usr/bin/perl
##
## sgmlstripper - Strip SGML markup from input.
##
## by Robert J Seymour <rseymour@rseymour.com>
## Copyright 1995, 1996, Robert Seymour and Springer-Verlag.
## All rights reserved. This program may be distributed and/or
## modified in electronic form under the same terms as Perl
## itself.
##
## CPAN menu:
#
# File Name: sgmlstripper
# File Size in BYTES: 1469
# Sender/Author/Poster: Robert J. Seymour <rseymour@rseymour.com>
# Subject: sgmlstripper - Strip SGML markup from input.
#
# sgmlstripper removes SGML markup tags from input (taken through
# specified files or STDIN). sgmlstripper uses a
# character-by-character read mode which, though not as fast as a
# regexp, is guaranteed to strip tags which fall across line or
# paragraph boundaries and preserves whitespace so that line numbers
# will be the same (the latter is useful for search engines which
# don't want to index markup, but want line numbers to be preserved).
## Use STDIN if no files are given
$ARGV[0] = "-" unless @ARGV;
## Strip out anything contained in an SGML markup tag. This is not
## very pretty and rather inefficient, but it does take care of tags
## which cross line or paragraph boundaries.
foreach $file (@ARGV) {
open(INPUT,$file);
while($char = getc(INPUT)) {
if($char eq "<") {
IGNORE: for(;;) {
last IGNORE if (getc(INPUT) eq ">");
}
} else {
print $char;
}
}
close(INPUT);
TIA, and sorry to be spamming the list with this OT post.
}
--
Bob Bernstein http://www.ruptured-duck.com
Reply to: