[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OT: how to strip out SGML tags?



On Sat, Sep 02, 2000 at 05:27:49PM -0400, Bob Bernstein wrote:
> erik <erik@bossa.org> wrote:
> 
> > > ##  Use STDIN if no files are given
> > > $ARGV[0] = "-" unless @ARGV;
> > > 
> > > ##  Strip out anything contained in an SGML markup tag.  This is not
> > > ##  very pretty and rather inefficient, but it does take care of tags
> > > ##  which cross line or paragraph boundaries.
> > > foreach $file (@ARGV) {
> > >   open(INPUT,$file);
        # while there's text to get
        while(<INPUT>) {
        	# while there's a starting (maybe complete) tag
        	while (s/<[^>]*(>?)//) {
        		# if not complete (<start but no finish)
        		if ( ! $1) {
        			my $tag;
        			while($tag = <INPUT>) {
        				# keep going until we find the end-of-tag>
        				last if $tag =~ s/.*?>//;
        			}
        			# maybe add a space wherever tags were ripped out? up 2 u
        			$_ .= $tag;
        		}
        	}
        	munge $_;
        }

note -- this ain't tested, but it looks to me like it's workable;
plus it reads lines at a time and uses the powerful perl muscles
to help you do your job... of course, tmtowtdi...

> I had trouble with your idea, but I went back to the original script I posted
> and discovered that the problem is it dies whenever a numerical '0' is
> encountered! Apart from that it works fine. It just so happened I had a '0' in
> the first few lines of my SGML, but I didn't get the implication.
> 
> So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping
> the flow down to closing the file. What's the perl equivalent of WHILE NOT
> EOF? <g>

	while (<FILEHANDLE>) { ... }
i.e.
	while ($_ = <FILEHANDLE>) { munge $_; }



Reply to: