[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[debian-knoppix] Fwd: Knoppix - static version of wiki content



Hi there,
I have been able to cross an item off my to-do list without actually doing it :)

Attached is something Julian Rendell (jgwr at shaw.ca) did for a Knoppix CD for his LUG, convert the Knoppix.net Docs / wiki into something suitable for distributing on CD. In the knoppix-net-docs.tar.bz2 is a folder called static_html and contains the docs, in English, German and French. Please take a look, do you think these docs are of a good enough quality to include on the Knoppix CD? If not, you can edit the files on the wiki.
I think that the current docs are a bit lacking, and this is an effort
to provide better documentation on the CD.

The Knoppix Wiki at http://www.knoppix.net/docs/ is licenced under the GNU FDL.
Best Regards
Eaden McKee

-------- Original Message --------
Subject: Success- static version of wiki content
Date: Sat, 10 May 2003 16:38:17 -0700
From: Julian Rendell <jgwr@shaw.ca>
To: Eaden McKee <email@eadz.co.nz>
References: <1051602499.3299.12.camel@thunder.ourhome.net> <3EAE34F6.3020005@eadz.co.nz> <1052009072.17140.32.camel@thunder.ourhome.net> <3EB46464.7020602@eadz.co.nz>



Hi Eaden-

well after a bit of work, I've got a mostly automated method of
converting the wiki to static html.  It's rough; it took me quite a bit
of time to get this working, and I've got a set of pages that are good
enough for my needs.  It's mostly automated, but there is one buglet
(duplicating bits of the last line, e.g. tml>) that affects some of the
converted files; they need manual checking.  Luckily this was only about
5 files out of 155.  If you look at it them with Mozilla (linux) you'll
notice your logo graphic has a white background.  I did try a
transparent background, but IE6 on Win98 screwed that up and displayed
it as a dark brown.  I'm not an HTML guru and don't want to dig into
this too much, so I just choose the least-bad looking of the two.

I've been documenting how our Knoppix-based disk is being put together;
I've included that bit re processing the knoppix.net doc wiki.  Also,
attached is the scripts, the hacked phpwiki theme, and the resultant
static pages.

Again, thanks for letting us use these pages, and I hope this proves
useful for the future.

All the best

Julian

A. Knoppix.net Wiki Documentation
---------------------------------
1. Contacted website owner (Eaden McKee) and asked if we may use the wiki
  contents on the cd.  He agreed; sent the raw wiki files.
  Raw wiki pages contain no html.
2. Installed phpwiki on my box.
3. Modified default phpwiki theme to remove most wiki related (dynamic) links in content
  pages.
  Removed all but BackLinks plugin from phpwiki's plugins directory.
  Modified the English and German home pages to remove the sections at the bottom of the page
  re how to use phpwiki; shouldn't need to do that again, as spider_bad.pl can be setup to
  remove those links.
4. Created following script to pull in the wiki as static html.  It's not perfect, but it works.
  Notes: -should integrate all 'find | perl' commands into the spider_bad_pages.pl
		  -changing the perl command line to -n instead of -p (i.e. to turn input echo off)
		   lead to empty html files; I thought the two were equivalent, except echoing the input.
		   Needs to be looked into...
		  -instead of re-creating a phpwiki theme, obtain the theme used for knoppix.net
		   Then add another filter section to spider_bad_pages.pl that removes the wiki feature related
		   links/sections.
		  -need to check and manually copy the altered pages; have .new appended to their names
		   For some reason, some files end up with an extra 'tml>' added to the end of the file.
		   Don't know why; needs to be looked into.

   generate.sh
----------- #!/bin/sh

		# little script to convert KnoppixNet Documenation wiki to static HTML
		wget -nH -E -r -l inf -k -p -np -D thunder.ourhome.net http://thunder.ourhome.net/phpwiki/index.php

		#make a backup of the downloaded files, in case the following goes wrong...
		cp -a phpwiki ../phpwiki-backup

		#remove the alpha server directory
		rm -rf phpwiki/index.php/PhpWikiAlpha\:en

		#get rid of all the 'action' related pages other than BackLink pages
		#(I felt the BackLinks pages were useful to navigation.)

		#note; used perl unlink(), to avoid issues with shell quotting; there are some
		#oddball file names for pages, that I don't think are linked anywhere...
		find ./ -name '*\?*' | grep -v 'BackLinks' | perl -pe 'chomp; unlink("$_");'

		#remove all the base hrefs- this causes problems with windows
		find ./ -name '*html' -exec perl -pi -e 's/<base href="" \/>//' {} \;

		#tidy up the ?action= filenames to be just '_'
		#NB- at this point these should just be the BackLink files
		find ./ -name '*action*' | perl -pe 'BEGIN {use File::Copy}; chomp; $file=$_; $newfile=$file; $newfile=~s/\?action=/_/; move($file, $newfile)'

		#adjust all the links of the form %3Faction= to '_'
		find ./ -name '*html' -exec perl -pi -e 's/%3Faction=/_/' {} \;

		#quite a few pages have edit icons for yet to be done wiki pages.  Lets remove those links.
		find ./ -name '*html' -exec perl -pi -e 's/(<span class="wikiunknown">)(<a href=.*?\?<\/a>)(.*?<\/span>)/$1$3/' {} \;

		#remove unwanted pages, and links to these pages
		cd phpwiki/index.php
		../../spider_bad_pages.pl

#next steps- check the changed pages- look for *.orig. #check for broken links via checkbot

   spider_bad_pages.pl
	-------------------
		#!/usr/bin/perl -wall
		use Fcntl;
		use File::Copy;
		use File::Find;

		#hash of bad files to start with
		@badToSearch= ( "WikiWikiWeb.html", "PageGroupTest.html" );

		#already looked through - a hash for easy searching
		%badFiles=();

		#repeat until badToSearch is empty
		while (scalar(@badToSearch)) {
		#get the first file from the list
		$file=shift(@badToSearch);
		print "\nchecking in file $file...";
		#open the file using sysopen to avoid problems with odd file names
		sysopen(IN, $file, O_RDONLY) or die "Couldn't open $file for reading: $!\n";
		#search the file for links.  Check %badFiles to see if the link has already been logged
		while (<IN>) {
		chomp;
		$line=$_;
		while ($line=~s/a href="(.*?)"(.*)/$2/) {
		$check=$1;

		#we don't want to include *_BackLinks* pages ... this page could have been referenced
		#by a page we want to keep.  We'll take care of BackLink pages as part of the cleanup for
		#the original page
		#same for HomePage
		next if ($check=~/BackLink|HomePage/);

		#make sure this link hasn't already been tested, or about to be, and that it's a file that exists in this dir
		if ((! exists($badFiles{$check})) && (! grep $_ eq $check, @badToSearch ) && (-f $check)) {
		print "...adding file $check to search";
		push(@badToSearch, $check);
		}
		}
		}
		close(IN);
		#add this file to %badFiles
		$badFiles{$file}="1";
		}

		print "Bad files are:";
		foreach $file (keys(%badFiles)) {
		print "$file";
		}
		print "\n\n";

		#process the bad files

		#sub to process files
		sub wanted {
		return if (!($File::Find::name=~m/html/) || !(-f $File::Find::name));
		print ">>>Adding $File::Find::name to list of files to be processed";
		push(@filesToProcess, $File::Find::name);
		}


		#iterate over keys(badfiles)
		#remove all these files so we don't spend time processing them
		foreach $badfile (keys(%badFiles)) {
		#create vars for badfile and badfile_BackLinks.html
		$badfileBL=$badfile;
		$badfileBL=~s/\.html/_BackLinks.html/;
		print ">>>Unlinking $badfile and $badfileBL";
		unlink($badfile);
		unlink($badfileBL);
		}

		@filesToProcess=();
		#iterate over all html files
		find(\&wanted, ".");


		foreach $badfile (keys(%badFiles)) {
		#create vars for badfile and badfile_BackLinks.html
		$badfileBL=$badfile;
		$badfileBL=~s/\.html/_BackLinks.html/;

		print ">>>Removing all refs to $badfile and $badfileBL";

		#iterate all files to be processed
		foreach $file (@filesToProcess) {
		print "\t>>>Processing file $file";
		#open the file using sysopen to avoid problems with odd file names
		sysopen(IN, $file, O_RDONLY) or die "Couldn't open $file for reading: $!\n";
		#read file in to array
		@contents=<IN>;
		close(IN);
		#iterate over array
		$changed=0;
		@newContents=();
		foreach $line (@contents) {
		$oldline=$line;
		#perl cookbook to the rescue...
		#replace all links to badfile and badfile_BackLinks
		1 while ($line=~s/(.*?)<a href="($badfile|$badfileBL).*?>(.*?)<\/a>(.*)/$1$3$4/);
		if ($oldline ne $line) {
		$changed=1;
		print "\t\t>>>Line Changed:\n$oldline\n---->\n$line";
		}
		push(@newContents, $line);
		}
		#write array to file- only if modified
		if ($changed) {
		$origFile=$file.".orig";
		#only copy it to backup if backup doesn't already exist
		if ( ! -f $origFile) {
		copy($file, $origFile);
		}
		print "\t\t>>>Writing changes to $nf";
		sysopen(OUT, $file, O_WRONLY|O_CREAT) or die "Couldn't open $file for writing: $!\n";
		print OUT @newContents;
		close(OUT);
		}
		}
		}

  ---anyone know how to make emacs indent-region relatively, ie to not remove pre-existing indentation?

5. I did the following to help check the altered pages:
		find . -name '*orig' | perl -ne '$_=~/(.*)\.orig/ ; print "$1 diff with $_"; system("diff $1 $_")' | less
  Some pages ended up with the last coule of lines repeated; don't know why, and it wasn't happening for all files.
  Quicker to visually inspect the 4/5 files at present than finding the root cause...

6. Test the link integrity of the file
		-copy the static html pages to a local web server
		-use checkbot to check the links; in particular look for any pages returning error 404 (not found)

7. Final page count: 422 -> 155 pages, only one 404 error- for an external link.





Attachment: knopppix-net-docs.tar.bz2
Description: application/bzip


Reply to: