Re: please use robots.txt for your gopher apps
Here is the resend, so the mailinglist can see it. Sorry, did it wrong
again. Maybe fix the mailinglist, so the headers are set properly?
Hello Cameron,
On Fri, 17 May 2019 14:09:40 +0200 Cameron Kaiser <spectre@floodgap.com> wrote:
> I love gopher apps and seeing them, but it is very hard for V2's robot to
> automatically recognize them, requiring lots of manual work to pull stuff
> out of the index that should never have been there in the first place. Please
> use a robots.txt selector to keep the V2 robot out of these areas; I'm
> considering a policy requirement that sites to be accepted to the new servers
> page must have some sort of robots.txt up since this is becoming a (happy)
> problem.
thanks for bringing this topic up.
The basic standard is:
# Comments
User-Agent: *
Diallow: /some/selector
First of all, we have a difference in http vs. gopher, where the http
GET begins with a »/«, but gopher selectors can have a value of no char-
acter at all.
Shall
Disallow: /
stand for
Disallow: /
and
Disallow:
?
Gopher crawlers coming via proxies will only require »/«, but real
gopher crawlers would not match any gopherhole having for fun some other
starting character.
I am voting for a recommendation to add »/« and »«, since there is a
difference.
How much of the inofficial robots.txt is needed for gopher? The more is
allowed, the more complex crawlers will be.
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nonstandard_extensions
Allow: /some/selector
In implementations this shouldn't be too hard to do. It is a simple matching.
Crawl-delay: 10
This needs to be used anyway, since some sites already block too fast
requests.
Sitemap: gopher://bitreich.org/0/example/sitemaps.xml
Enforce XML in gopher?
Host: preferred-domain.com
We don't have virtual hosting, but many hosts could point to the same
gopherhole.
Instead of sitemaps, I want to propose a standard to store a mirror of the
menus of a gopherhole, which crawlers can download, to save space.
User-Agent: *
Disallow: /
Disallow:
Gophermap: gopher://bitreich.org/9/gophermap.zip
Why .zip? So we have a integrity checking. Otherwise we need over gopher
some checksum too.
$ sacc gopher://bitreich.org/9/gophermap.zip > gophermap.zip
$ gunzip gophermap.zip
$ cd bitreich.org/lawn/fun
$ cat .index
i Err bitreich.org 70
i___________________________________F_U_N__________________________________ Err bitreich.org 70
i Err bitreich.org 70
7Hypochondria disease search /hypochondria bitreich.org 70
iFind the worst disease fitting to your current situation. Err bitreich.org 70
Which means, you download the raw content, which you would get from
the server.
The crawler must accept the content as-is and not crawl the server.
Any recommendations?
Sincerely,
Christoph Lohmann
Reply to: