[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: please use robots.txt for your gopher apps



On Wednesday 22 May 2019 09:15,
Alex Schröder <kensanata@gmail.com> put forth the proposition:
> I'm interested in the details since my wiki is also available via
> gopher. I'm trying to understand what I should be adding exactly.
>
> First, the selector. Is it "robots.txt" or "/robots.txt"? On my site,
> the selectors don't start with a slash but I'm assuming we're going
> with a slash? Thus the correct place would be
> gopher://alexschroeder.ch:70/0/robots.txt
>
> Next, the content. I have some patterns I'd like to disallow, but I
> guess I made some choices regarding the selectors that will come back
> to haunt me when I look at how robots.txt works. for example, page
> history. That is not something that needs to be indexed.
>
> Here are some selectors for my About page:
> gopher://alexschroeder.ch:70/1About/menu (the entry point)
> gopher://alexschroeder.ch:70/0About (the plain text)
> gopher://alexschroeder.ch:70/1About/history (the list of old
> revisions, if available)
> gopher://alexschroeder.ch:70/1About/10/menu (revision 10)
> gopher://alexschroeder.ch:70/0About/10 (the plain text of revision 10)
>
> It was more or less designed with the idea that "up" would return you
> to a usable URL. Thus gopher://alexschroeder.ch:70/1About/history
> seemed more reasonable than
> gopher://alexschroeder.ch:70/1history/About. It doesn't quite work
> that way right now because "gopher://alexschroeder.ch:70/1About has
> the wrong item type (but surely something could be added) where as
> gopher://alexschroeder.ch:70/1history loses the context of the About
> page.
>
> Anyway, what I'm trying to say is that I'd like to have patterns such as these:
>
> Disallow: */history
> Disallow: */\d+
>
> What do you think? It definitely doesn't match what the WWW robots.txt
> does, I know.

I'm all for using regex patterns, as you have done in the second line
there, over simple globbing.

I think it would really depend on how much Cameron wants to build
into the crawler, and how much in the way of resources it would
use.

Should there also use an order too, or have it assumed and hard-coded?

e.g. 

Order: allow, disallow
Allow: *
Disallow: */cache/*
Disallow: */history/*

etc.

-Dave

> Cheers
> Alex
>


--

A person is just about as big as the things that make them angry.


Reply to: