[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[gopher] Gopher++ scrapped & Internet Archive -style thingy



As part of my project to code a neat search engine to cover the whole Gopherspace I've (partially) crawled sites and snooped and researched a lot of stuff.

Let's just say that the Gopherspace is small, but interesting. I'm glad I started crawling :-).

Anyway.

Whatever I've written about the gopher++ extra headers can now be considered as "obsolete". I found a few live sites which just cannot accept anything else than a selector<CRLF> so there's no way I can insert extra headers without breaking stuff. Those sites even break with type 7 queries (and gopher+) so I'm kind of giving up now.

All code regarding the header extensions has been scrapped and deleted, it's all gone for good. The good thing is that my code is now 100% compatible with ALL early 90's servers but the bad thing is that the neat charset conversion thingy is now all gone and we're back to 7-bit US-ASCII (or non-working Latin/UTF). Oh, well.

As my search engines indexer is an offline one my spider basically crawls around and saves all type 0&1 files to a local cache hierarcy. This was mostly accidental, but I managed to create something very much like The Internet Archive but for gopher. Basically, you give the cache manager an url and it gives you back the cached page (if it has it) AND it mangles menus so that as long as the pages are in cache you'll stay in the cache.

It's kind of like a combination of Google's cache and archive.org, only it works better than either of those...

Here's a cached copy of (partial) Floodgap:
gopher://gophernicus.org/1/cache.q?gopher://gopher.floodgap.com

It even cached itself:
gopher://gophernicus.org/1/cache.q?gopher://gophernicus.org

Notice how the cached Floodgap is much faster than the original one ;D. I wish there was something like this for teh web....

<turtleneck shirt mode on>
One more thing,
</turtleneck>

I'll be crawling everything in about a month or so, so now is the time to fix your robots.txt if you don't want your files to end up in the cache.


- Kim



_______________________________________________
Gopher-Project mailing list
Gopher-Project@lists.alioth.debian.org
http://lists.alioth.debian.org/mailman/listinfo/gopher-project




Reply to: