[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Stopping webcrawlers.



On Sunday 03 November 2019 11:56:52 Reco wrote:

> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote:
> > On Sunday 03 November 2019 10:23:50 Reco wrote:
> > > On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote:
> > > > Greetings all
> > > >
> > > > I am developing a list of broken webcrawlers who are repeatedly
> > > > downloading my entire web site including the hidden stuff.
> > > >
> > > > These crawlers/bots are ignoring my robots.txt
> > >
> > > $ wget -O - https://www.shentel.com/robots.txt
> > > --2019-11-03 15:22:35--  https://www.shentel.com/robots.txt
> > > Resolving www.shentel.com (www.shentel.com)... 45.60.160.21
> > > Connecting to www.shentel.com
> > > (www.shentel.com)|45.60.160.21|:443... connected. HTTP request
> > > sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 ERROR
> > > 403: Forbidden.
> > >
> > > Allowing said bots to *see* your robots.txt would be a step into
> > > the right direction.
> >
> > But you are asking for shentel.com/robots.txt which is my isp.
> > You should be asking for
> >
> > http://geneslinuxbox.net:6309/gene/robots.txt
>
> Wow. You sir owe me a new set of eyes.

Chuckle :) That was the default I'd pickup up from someplace years ago.

> I advise you to compare your monstrosity to this (a hint - it does
> work) - [1].
>
> Reco
>
> [1] https://enotuniq.net/robots.txt

I'll trim mine forthwith to the last entry.  I've wondered if that was 
too long a list. And restart apache2 of course. But now I see the next 
access is not a 200, but a 404, that not intended. From the access log:

coyote.coyote.den:80 209.197.24.34 - - 
[03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 
HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; 
rv:11.0) like Gecko"

that directory exists, shouldn't that have been a 200?

Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis
Genes Web page <http://geneslinuxbox.net:6309/gene>


Reply to: