[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [calvin@net.uni-sb.de: Bug#79627: ITP: linkchecker -- check HTML pages for broken links]



On Fri, Dec 15, 2000 at 09:05:29AM +0100, Josip Rodin wrote:
> 
> Could this be useful?
> 
Maybe. It appears to be extremely similar to the program I wrote.
They are both written in python and use the same python libraries.
The big difference is that he chose to use threads, while I chose
to limit the time spent searching for a given URL, which requires
using signals (which use exceptions). The problem with using
exceptions is that they don't work well with threads.

The biggest bottleneck for the program is sites that timeout.
By default, it takes about 13 minutes for a connection to
timeout. I currently have the timeout set to 15 seconds in
my program, which should give it an edge over the other,
assuming it uses 10 threads. Also, my program caches the
results of a timeout, while his doesn't, which will really
slow his down.

I am running linkchecker on our pages (as of Fri Dec 15 18:57:57 UTC 2000)
and you can see the output at
http://www.debian.org/~treacy/urlcheck/scripts/out

The following is from an included mail, not from Josip Rodin:
> LinkChecker is a tool I wrote some time ago to check my HTML pages.
> It has grown into a program that can check whole web structures
> for broken links.
> Its licensed under the GPL.
> 
Comparison (responses are whether my program has that function):
> Features:
> o recursive checking
yes
> o multithreading
no
> o output in colored or normal text, HTML, SQL, CSV or a sitemap
>   graph in GML or XML.
who cares
> o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
>   file links support
http and ftp only, which work well for our site.
> o restriction of link checking with regular expression filters for URLs
yes
> o proxy support
no and not needed since it's our site
> o username/password authorization for HTTP and FTP
no, but not needed for our site
> o robots.txt exclusion protocol support
no, but since it is our site, I want to check what I want to check
> o i18n support
no
> o a command line interface
yes
> o a (Fast)CGI web interface (requires HTTP server)
trivial to set up

My program allows you to set the timeout for trying a connection,
which his doesn't. Unfortunately, there is a tradeoff between these,
unless you are willing to jump through some hoops.

-- 
James (Jay) Treacy
treacy@debian.org



Reply to: