[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Stopping webcrawlers.



On Sunday 03 November 2019 12:37:14 john doe wrote:

> On 11/3/2019 6:26 PM, Gene Heskett wrote:
> > On Sunday 03 November 2019 11:56:52 Reco wrote:
> >> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote:
> >>> On Sunday 03 November 2019 10:23:50 Reco wrote:
> >>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote:
> >>>>> Greetings all
> >>>>>
> >>>>> I am developing a list of broken webcrawlers who are repeatedly
> >>>>> downloading my entire web site including the hidden stuff.
> >>>>>
> >>>>> These crawlers/bots are ignoring my robots.txt
> >>>>
> >>>> $ wget -O - https://www.shentel.com/robots.txt
> >>>> --2019-11-03 15:22:35--  https://www.shentel.com/robots.txt
> >>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21
> >>>> Connecting to www.shentel.com
> >>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request
> >>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36
> >>>> ERROR 403: Forbidden.
> >>>>
> >>>> Allowing said bots to *see* your robots.txt would be a step into
> >>>> the right direction.
> >>>
> >>> But you are asking for shentel.com/robots.txt which is my isp.
> >>> You should be asking for
> >>>
> >>> http://geneslinuxbox.net:6309/gene/robots.txt
> >>
> >> Wow. You sir owe me a new set of eyes.
> >
> > Chuckle :) That was the default I'd pickup up from someplace years
> > ago.
> >
> >> I advise you to compare your monstrosity to this (a hint - it does
> >> work) - [1].
> >>
> >> Reco
> >>
> >> [1] https://enotuniq.net/robots.txt
> >
> > I'll trim mine forthwith to the last entry.  I've wondered if that
> > was too long a list. And restart apache2 of course. But now I see
> > the next access is not a 200, but a 404, that not intended. From the
> > access log:
> >
> > coyote.coyote.den:80 209.197.24.34 - -
> > [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4
> > HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64;
> > Trident/7.0; rv:11.0) like Gecko"
> >
> > that directory exists, shouldn't that have been a 200?
>
> The directory might exist but it is not accessible.
>
> --
> John Doe
Universal read perms would be 444. does it need any more than that to be 
downloadable?

Thanks John.


Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis
Genes Web page <http://geneslinuxbox.net:6309/gene>


Reply to: