[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fixhrefgz - tool for converting anchors to gzipped files



Thanks Lars for the tool.

I wrote exactly the same thing in Perl (on your request!) some time ago. I
have attached it to this mail.

I don't know which version is better. It looks like Lars' implementation
has hard coded a lot of HTML tags for processing. Mine is based on Perl's
HTML::Parser class and is thus independent of any specific HTML tags.


Thanks,

Chris

--                  Christian Schwarz
                     schwarz@monet.m.isar.de, schwarz@schwarz-online.com,
Debian is looking     schwarz@debian.org, schwarz@mathematik.tu-muenchen.de
for a logo! Have a
look at our drafts     PGP-fp: 8F 61 EB 6D CF 23 CA D7  34 05 14 5C C8 DC 22 BA
at    http://fatman.mathematik.tu-muenchen.de/~schwarz/debian-logo/
#!/usr/bin/perl
#
# fixhtmlgz 0.2
# Copyright (c) 1997 by Christian Schwarz <schwarz@monet.m.isar.de>
# May by distributed under GPL 2.
#

# Specification:
#
# Currently, we have a problem with compressed HTML: we can access
# compressed HTML fine, but links don't work very well. The problem
# is that the link says "foo.html", and the actual file is
# "foo.html.gz",
# and the browsers and servers aren't intelligent enough to handle
# this invisibly. This means that we can't install compressed HTML, if
# it contains links.
# 
# We need a program that can be run on uncompressed HTML, which converts
# local links to the compressed versions of the files. Usage would
# be something like:
# 
#         fixhtmlgz file.html ...
# 
#         - read file.html
#         - for each link <a href="foo.html">, if foo.html exists,
#           convert the link to foo.html.gz instead
#         - otherwise, do not modify the link
#         - output is either to file.html.fixed or file.html (replace
#           original with modified version)
#
# Changes:
#      v0.2:
#         - now handles gzipped files
#	  - parse .html and .htm files
#	  - changed replacing rule: change href to refer to the
#	    file, as it actually exists. Example:
#		<a href="foo.html"> will only be converted to
#	        foo.html.gz, if this file exists, and not if
#	        foo.html exists.
# 

package Parser; #-------------------------------
require HTML::Parser;
@ISA = qw(HTML::Parser);

sub declaration {
  my ($self, $decl) = @_;
  print ::OUT "<!$decl>";
}

sub start {
  my ($self, $tag, $attr, $attrseq, $origtext) = @_;

  if ($tag eq 'a') {
    if ($href = $$attr{'href'}) {
      if (!($href =~ s/^(\S+:)//o) or ($1 =~ /file:/i)) {
	$type = $1;
	$href =~ s/(\#.*)$//o;
	$anchor = $1;
        #print "href: ($type,$href,$anchor)\n";
	if (($href =~ /\.html$/) and -f $href) {
	  # append `.gz'
	  $$attr{'href'} = "$type$href.gz$anchor";
	  # rebuild origtext.
	  $origtext = "<a";
	  for $tag (@$attrseq) {
	    if ($$attr{$tag}) {
	      $origtext .= " $tag=\"$$attr{$tag}\"";
	    } else {
	      $origtext .= " $tag";
	    }
	  }
	  $origtext .= ">";
	}
      }
    }
  }

pass:
  print ::OUT "$origtext";
}

sub end {
  my ($self, $tag) = @_;
  print ::OUT "</$tag>";
}

sub text {
  my ($self, $text) = @_;
  print ::OUT "$text";
}

sub comment {
  my ($self, $comment) = @_;
  print ::OUT "<!--$comment-->";
}

#########################################################################

package main;

if ($#ARGV == -1) {
  print "usage: fixhtmlgz <html file> ...\n";
  exit 1;
}

$p = Parser->new;

while ($filename = shift) {
  if ( ! -f $filename ) {
    print "error: file $filename not found, skipping.\n";
    next;
  }

  $output = "$filename.fixed";
  open(OUT,">$output") or die "cannot open output file $output: $!";

  $p->parse_file($filename);

  close(OUT);

  rename($filename,"$filename.bak") or die "cannot rename $filename: $!";
  rename($output,$filename) or die "cannot rename $output: $!";
}

exit 0;


Reply to: