[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: please use robots.txt for your gopher apps



Here is the resend, so the mailinglist can see it. Sorry, did it wrong
again. Maybe fix the mailinglist, so the headers are set properly?


Hello Cameron,

On Fri, 17 May 2019 14:09:40 +0200 Cameron Kaiser <spectre@floodgap.com> wrote:
> I love gopher apps and seeing them, but it is very hard for V2's robot to
> automatically recognize them, requiring lots of manual work to pull stuff
> out of the index that should never have been there in the first place. Please
> use a robots.txt selector to keep the V2 robot out of these areas; I'm
> considering a policy requirement that sites to be accepted to the new servers
> page must have some sort of robots.txt up since this is becoming a (happy)
> problem.

thanks for bringing this topic up.

The basic standard is:

	# Comments
	User-Agent: *
	Diallow: /some/selector


First of all, we have a difference in http vs. gopher, where the http
GET begins with a »/«, but gopher selectors can have a value of no char-
acter at all.

Shall
	Disallow: /
stand for
	Disallow: /
and
	Disallow:
?

Gopher crawlers coming via proxies will only require »/«, but real
gopher crawlers would not match any gopherhole having for fun some other
starting character.

I am voting for a recommendation to add »/« and »«, since there is a
difference.


How much of the inofficial robots.txt is needed for gopher? The more is
allowed, the more complex crawlers will be.

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nonstandard_extensions

	Allow: /some/selector

In implementations this shouldn't be too hard to do. It is a simple matching.

	Crawl-delay: 10

This needs to be used anyway, since some sites already block too fast
requests.

	Sitemap: gopher://bitreich.org/0/example/sitemaps.xml

Enforce XML in gopher?

	Host: preferred-domain.com

We don't have virtual hosting, but many hosts could point to the same
gopherhole.


Instead of sitemaps, I want to propose a standard to store a mirror of the
menus of a gopherhole, which crawlers can download, to save space.

	User-Agent: *
	Disallow: /
	Disallow:
	Gophermap: gopher://bitreich.org/9/gophermap.zip

Why .zip? So we have a integrity checking. Otherwise we need over gopher
some checksum too.

	$ sacc gopher://bitreich.org/9/gophermap.zip > gophermap.zip
	$ gunzip gophermap.zip
	$ cd bitreich.org/lawn/fun
	$ cat .index
	i       Err     bitreich.org    70
	i___________________________________F_U_N__________________________________	Err     bitreich.org    70
	i       Err     bitreich.org    70
	7Hypochondria disease search    /hypochondria   bitreich.org    70
	iFind the worst disease fitting to your current situation.      Err	bitreich.org    70

Which means, you download the raw content, which you would get from
the server.

The crawler must accept the content as-is and not crawl the server.

Any recommendations?


Sincerely,

Christoph Lohmann


Reply to: