[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Stopping webcrawlers.



On Sunday 03 November 2019 16:51:16 Richard Hector wrote:

> On 4/11/19 10:05 AM, Gene Heskett wrote:
> > On Sunday 03 November 2019 12:37:14 john doe wrote:
> >> On 11/3/2019 6:26 PM, Gene Heskett wrote:
> >>> On Sunday 03 November 2019 11:56:52 Reco wrote:
> >>>> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote:
> >>>>> On Sunday 03 November 2019 10:23:50 Reco wrote:
> >>>>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote:
> >>>>>>> Greetings all
> >>>>>>>
> >>>>>>> I am developing a list of broken webcrawlers who are
> >>>>>>> repeatedly downloading my entire web site including the hidden
> >>>>>>> stuff.
> >>>>>>>
> >>>>>>> These crawlers/bots are ignoring my robots.txt
> >>>>>>
> >>>>>> $ wget -O - https://www.shentel.com/robots.txt
> >>>>>> --2019-11-03 15:22:35--  https://www.shentel.com/robots.txt
> >>>>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21
> >>>>>> Connecting to www.shentel.com
> >>>>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request
> >>>>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36
> >>>>>> ERROR 403: Forbidden.
> >>>>>>
> >>>>>> Allowing said bots to *see* your robots.txt would be a step
> >>>>>> into the right direction.
> >>>>>
> >>>>> But you are asking for shentel.com/robots.txt which is my isp.
> >>>>> You should be asking for
> >>>>>
> >>>>> http://geneslinuxbox.net:6309/gene/robots.txt
> >>>>
> >>>> Wow. You sir owe me a new set of eyes.
> >>>
> >>> Chuckle :) That was the default I'd pickup up from someplace years
> >>> ago.
> >>>
> >>>> I advise you to compare your monstrosity to this (a hint - it
> >>>> does work) - [1].
> >>>>
> >>>> Reco
> >>>>
> >>>> [1] https://enotuniq.net/robots.txt
> >>>
> >>> I'll trim mine forthwith to the last entry.  I've wondered if that
> >>> was too long a list. And restart apache2 of course. But now I see
> >>> the next access is not a 200, but a 404, that not intended. From
> >>> the access log:
> >>>
> >>> coyote.coyote.den:80 209.197.24.34 - -
> >>> [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4
> >>> HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64;
> >>> Trident/7.0; rv:11.0) like Gecko"
> >>>
> >>> that directory exists, shouldn't that have been a 200?
> >>
> >> The directory might exist but it is not accessible.
> >>
> >> --
> >> John Doe
> >
> > Universal read perms would be 444. does it need any more than that
> > to be downloadable?
>
> IIRC all the directories back to / need to be executable as well as
> readable, by the web server.
>
> Richard

Thats impossible here as its running in a ownership sandbox all owned 
only by apache2. OTOH, as far as apache2 is concerned / is 
the /var/www/html directory, and everything beyond that is owned by the 
same group apache2 is a member of. AIUI, in robots.txt, a more 
permissive rule above the one that apparently has most locked out.

What I'd need is a test, probably not User-agent, but that would match 
and allow the browsers + curl and wget from normal users in.  What would 
that rule look like? Or would that come under User-agents too?  These 
bots all  seem to use GET.

Thanks Richard.

Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis
Genes Web page <http://geneslinuxbox.net:6309/gene>


Reply to: