Re: OT: how to strip out SGML tags?
On Sat, Sep 02, 2000 at 05:27:49PM -0400, Bob Bernstein wrote:
> erik <erik@bossa.org> wrote:
>
> > > ## Use STDIN if no files are given
> > > $ARGV[0] = "-" unless @ARGV;
> > >
> > > ## Strip out anything contained in an SGML markup tag. This is not
> > > ## very pretty and rather inefficient, but it does take care of tags
> > > ## which cross line or paragraph boundaries.
> > > foreach $file (@ARGV) {
> > > open(INPUT,$file);
# while there's text to get
while(<INPUT>) {
# while there's a starting (maybe complete) tag
while (s/<[^>]*(>?)//) {
# if not complete (<start but no finish)
if ( ! $1) {
my $tag;
while($tag = <INPUT>) {
# keep going until we find the end-of-tag>
last if $tag =~ s/.*?>//;
}
# maybe add a space wherever tags were ripped out? up 2 u
$_ .= $tag;
}
}
munge $_;
}
note -- this ain't tested, but it looks to me like it's workable;
plus it reads lines at a time and uses the powerful perl muscles
to help you do your job... of course, tmtowtdi...
> I had trouble with your idea, but I went back to the original script I posted
> and discovered that the problem is it dies whenever a numerical '0' is
> encountered! Apart from that it works fine. It just so happened I had a '0' in
> the first few lines of my SGML, but I didn't get the implication.
>
> So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping
> the flow down to closing the file. What's the perl equivalent of WHILE NOT
> EOF? <g>
while (<FILEHANDLE>) { ... }
i.e.
while ($_ = <FILEHANDLE>) { munge $_; }
Reply to: