[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fixhrefgz - tool for converting anchors to gzipped files



This was discussed half a year ago and the webservers were fitted
with on the fly decompression for .gz files. What dwww does is already
not necessary. Changing the content of .html files might lead to problems
with web browsers. Not all platforms have a gzip by default available.

Please do not do this. We do not have any problems here and you are about to create some.



In article <m0whN3K-000AjjC@liw.clinet.fi> you wrote:
: --==_Exmh_817738214P
: Content-Type: multipart/mixed ;
: 	boundary="==_Exmh_8169585350"

: This is a multipart MIME message.

: --==_Exmh_8169585350
: Content-Type: text/plain; charset=us-ascii

: [ Please don't Cc: public replies to me. ]

: During the recent thread on providing documentation in HTML,
: the need to compress it was pointed out. The compression itself
: is a trivial application of find, xargs, and gzip (or just gzip,
: of course), but that changes the files, so that links within
: the documentation break.

: Things work if you read the documentation through dwww, since
: dwww gives you foo.html.gz, if it exists and foo.html doesn't
: exist. That doesn't help if you browse the filesystem directly,
: and not via dwww and a web server.

: I hacked together a Python program that converts the links
: in the files themselves. It is attached.

: I've tried it with one of my own packages (sex), and it seems
: to work. Browsing the filesystem directly works, if the browser
: can handle gzipped files. Lynx works; Netscape 3.01 doesn't
: work, but I seem to recall that an earlier version did work.
: Someone familiar with mailcap might be able to get Netscape
: to work as well.

: Comments?

: -- 
: Please read <http://www.iki.fi/liw/mail-to-lasu.html> before mailing me.
: Please don't Cc: public replies to me.


: --==_Exmh_8169585350
: Content-Type: application/octet-stream ; name="fixhrefgz"
: Content-Description: fixhrefgz
: Content-Disposition: attachment; filename="fixhrefgz"

: #!/usr/bin/python

: """Convert local links in HTML documents to/from gzipped documents.

: Usage: fixhrefgz [-hzu] [--help] [--gzip] [--gunzip] [file ...]

: This program will convert links to local documents so that they
: point at the version compressed with gzip. Before conversion, an
: anchor tag might look like this:

: 	<a href="foo.html">foo</a>

: After conversion, it will look like this:

: 	<A HREF="foo.html.gz">foo</A>

: This allows one to compress HTML files. All other tags are
: unchanged by this program (except for case conversion).

: Lars Wirzenius, liw@iki.fi.

: """

: import formatter, htmllib, sys, urlparse, getopt, StringIO

: def gzip_mangler(path):
: 	if path[-5:] == ".html" or path[-4:] == ".htm":
: 		path = path + ".gz"
: 	return path

: def gunzip_mangler(path):
: 	if path[-8:] == ".html.gz" or path[-7:] == ".htm.gz":
: 		path = path[:-3]
: 	return path

: mangler = gzip_mangler

: class ParseAndCat(htmllib.HTMLParser):
: 	def __init__(self, formatter, verbose=0):
: 		htmllib.HTMLParser.__init__(self, formatter, verbose)
: 		self.nofill = 1

: 	def anchor_bgn(self, href, name, type):
: 		parts = urlparse.urlparse(href)
: 		if not parts[0] and not parts[1]:
: 			path = parts[2]
: 			path = mangler(path)
: 			parts = (parts[0], parts[1], path,
: 				 parts[3], parts[4], parts[5])
: 			href = urlparse.urlunparse(parts)
: 			
: 		s = '<A'
: 		if href: s = s + (' HREF="%s"' % href)
: 		if name: s = s + (' NAME="%s"' % name)
: 		if type: s = s + (' TYPE="%s"' % type)
: 		s = s + '>'
: 		self.formatter.add_literal_data(s)

: 	def anchor_end(self):
: 		self.formatter.add_literal_data('</A>')
: 	
: 	def handle_image(self, src, alt, ismap, align, width, height):
: 		s = '<IMG'
: 		if src: s = s + (' SRC="%s"' % src)
: 		if alt: s = s + (' ALT="%s"' % alt)
: 		if ismap: s = s + ' ISMAP'
: 		if align: s = s + (' ALIGN="%s"' % align)
: 		if width: s = s + (' WIDTH="%s"' % width)
: 		if height: s = s + (' HEIGHT="%s"' % height)
: 		s = s + '>'
: 		self.formatter.add_literal_data(s)
: 	
: 	def _format_tag(self, tag, attrs):
: 		s = '<' + tag
: 		for attr, value in attrs:
: 			if value:
: 				s = s + (' %s="%s"' % (attr, value))
: 			else:
: 				s = s + (' %s' % attr)
: 		s = s + '>'
: 		self.formatter.add_literal_data(s)

: 	def start_html(self, attrs):	self._format_tag('HTML', attrs)
: 	def end_html(self):		self._format_tag('/HTML', [])

: 	def start_head(self, attrs):	self._format_tag('HEAD', attrs)
: 	def end_head(self):		self._format_tag('/HEAD', [])

: 	def start_body(self, attrs):	self._format_tag('BODY', attrs)
: 	def end_body(self):		self._format_tag('/BODY', [])

: 	def start_title(self, attrs):	self._format_tag('TITLE', attrs)
: 	def end_title(self):		self._format_tag('/TITLE', [])

: 	def do_base(self, attrs):	self._format_tag('BASE', attrs)
: 	def do_isindex(self, attrs):	self._format_tag('ISINDEX', attrs)
: 	def do_link(self, attrs):	self._format_tag('LINK', attrs)
: 	def do_meta(self, attrs):	self._format_tag('META', attrs)
: 	def do_nextid(self, attrs):	self._format_tag('NEXTID', attrs)

: 	def start_h1(self, attrs):	self._format_tag('H1', attrs)
: 	def end_h1(self):		self._format_tag('/H1', [])

: 	def start_h2(self, attrs):	self._format_tag('H2', attrs)
: 	def end_h2(self):		self._format_tag('/H2', [])

: 	def start_h3(self, attrs):	self._format_tag('H3', attrs)
: 	def end_h3(self):		self._format_tag('/H3', [])

: 	def start_h4(self, attrs):	self._format_tag('H4', attrs)
: 	def end_h4(self):		self._format_tag('/H4', [])

: 	def start_h5(self, attrs):	self._format_tag('H5', attrs)
: 	def end_h5(self):		self._format_tag('/H5', [])

: 	def start_h6(self, attrs):	self._format_tag('H6', attrs)
: 	def end_h6(self):		self._format_tag('/H6', [])

: 	def do_p(self, attrs):		self._format_tag('P', attrs)

: 	def start_pre(self, attrs):	self._format_tag('PRE', attrs)
: 	def end_pre(self):		self._format_tag('/PRE', [])

: 	def start_xmp(self, attrs):	self._format_tag('XMP', attrs)
: 	def end_xmp(self):		self._format_tag('/XMP', [])

: 	def start_listing(self, attrs):	self._format_tag('LISTING', attrs)
: 	def end_listing(self):		self._format_tag('/LISTING', [])

: 	def start_address(self, attrs):	self._format_tag('ADDRESS', attrs)
: 	def end_address(self):		self._format_tag('/ADDRESS', [])

: 	def start_blockquote(self, attrs):
: 					self._format_tag('BLOCKQUOTE', attrs)
: 	def end_blockquote(self):	self._format_tag('/BLOCKQUOTE', [])

: 	def start_ul(self, attrs):	self._format_tag('UL', attrs)
: 	def end_ul(self):		self._format_tag('/UL', [])

: 	def do_li(self, attrs):		self._format_tag('LI', attrs)

: 	def start_ol(self, attrs):	self._format_tag('OL', attrs)
: 	def end_ol(self):		self._format_tag('/OL', [])

: 	def start_menu(self, attrs):	self._format_tag('MENU', attrs)
: 	def end_menu(self):		self._format_tag('/MENU', [])

: 	def start_dir(self, attrs):	self._format_tag('DIR', attrs)
: 	def end_dir(self):		self._format_tag('/DIR', [])

: 	def start_dl(self, attrs):	self._format_tag('DL', attrs)
: 	def end_dl(self):		self._format_tag('/DL', [])

: 	def do_dt(self, attrs):		self._format_tag('DT', attrs)
: 	def do_dd(self, attrs):		self._format_tag('DD', attrs)

: 	def start_cite(self, attrs):	self._format_tag('CITE', attrs)
: 	def end_cite(self):		self._format_tag('/CITE', [])

: 	def start_code(self, attrs):	self._format_tag('CODE', attrs)
: 	def end_code(self):		self._format_tag('/CODE', [])

: 	def start_em(self, attrs):	self._format_tag('EM', attrs)
: 	def end_em(self):		self._format_tag('/EM', [])

: 	def start_kbd(self, attrs):	self._format_tag('KBD', attrs)
: 	def end_kbd(self):		self._format_tag('/KBD', [])

: 	def start_samp(self, attrs):	self._format_tag('SAMP', attrs)
: 	def end_samp(self):		self._format_tag('/SAMP', [])

: 	def start_strong(self, attrs):	self._format_tag('STRONG', attrs)
: 	def end_strong(self):		self._format_tag('/STRONG', [])

: 	def start_var(self, attrs):	self._format_tag('VAR', attrs)
: 	def end_var(self):		self._format_tag('/VAR', [])

: 	def start_i(self, attrs):	self._format_tag('I', attrs)
: 	def end_i(self):		self._format_tag('/I', [])

: 	def start_b(self, attrs):	self._format_tag('B', attrs)
: 	def end_b(self):		self._format_tag('/B', [])

: 	def start_tt(self, attrs):	self._format_tag('TT', attrs)
: 	def end_tt(self):		self._format_tag('/TT', [])

: 	def do_br(self, attrs):		self._format_tag('BR', attrs)
: 	def do_hr(self, attrs):		self._format_tag('HR', attrs)

: 	def unknown_starttag(self, tag, attrs):
: 		self._format_tag(tag, attrs)
: 	def unknown_endtag(self, tag):
: 		self._format_tag(tag, [])

: def process(file):
: 	result = StringIO.StringIO()
: 	f = formatter.AbstractFormatter(formatter.DumbWriter(file=result))
: 	p = ParseAndCat(f)
: 	p.feed(file.read())
: 	p.close()
: 	result.seek(0,0)
: 	return result.read()

: def usage():
: 	print "usage: fixhrefgz [-hzu] [--help] [--gzip] [--gunzip] [file ...]"
: 	sys.exit(0)

: if __name__ == "__main__":
: 	opts, argv = getopt.getopt(sys.argv[1:], "hzu",
: 				[ "help", "gzip", "gunzip" ])
: 	for opt, value in opts:
: 		if opt == "-h" or opt == "--help":
: 			usage()
: 		elif opt == "-z" or opt == "gzip":
: 			mangler = gzip_mangler
: 		elif opt == "-u" or opt == "gunzip":
: 			mangler = gunzip_mangler
: 	if argv:
: 		for filename in argv:
: 			f = open(filename, "r")
: 			result = process(f)
: 			f = open(filename, "w")
: 			f.write(result)
: 			f.close()
: 	else:
: 		sys.stdout.write(process(sys.stdin))

: --==_Exmh_8169585350--



: --==_Exmh_817738214P
: Content-Type: application/pgp-signature

: -----BEGIN PGP MESSAGE-----
: Version: 2.6.3ia

: iQCVAwUBM7LtbIQRll5MupLRAQE3ZgP+L2gBQsAVKeIg7mvyAbs8bzgqgvfqpRZn
: zZGrvqbOkohIOKHuoZ4WrpXhChRvQcLHWKRZ0K56TBAFkj5XIEtkiFcRe2h9URqU
: rkKamJNBYvyY9W9b6+dWmPQeryd7Yr28yQR30r7S+gwZg6mMyZNYLWJMCh+3ZbjH
: jhOFY6qJWv8=
: =HEzh
: -----END PGP MESSAGE-----

: --==_Exmh_817738214P--


: --
: TO UNSUBSCRIBE FROM THIS MAILING LIST: e-mail the word "unsubscribe" to
: debian-devel-request@lists.debian.org . 
: Trouble?  e-mail to templin@bucknell.edu .



-- 
--- +++ --- +++ --- +++ --- +++ --- +++ --- +++ --- +++ ---
Please always CC me when replying to posts on mailing lists.


--
TO UNSUBSCRIBE FROM THIS MAILING LIST: e-mail the word "unsubscribe" to
debian-devel-request@lists.debian.org . 
Trouble?  e-mail to templin@bucknell.edu .


Reply to: