[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Stopping webcrawlers.



On 4/11/19 10:05 AM, Gene Heskett wrote:
> On Sunday 03 November 2019 12:37:14 john doe wrote:
> 
>> On 11/3/2019 6:26 PM, Gene Heskett wrote:
>>> On Sunday 03 November 2019 11:56:52 Reco wrote:
>>>> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote:
>>>>> On Sunday 03 November 2019 10:23:50 Reco wrote:
>>>>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote:
>>>>>>> Greetings all
>>>>>>>
>>>>>>> I am developing a list of broken webcrawlers who are repeatedly
>>>>>>> downloading my entire web site including the hidden stuff.
>>>>>>>
>>>>>>> These crawlers/bots are ignoring my robots.txt
>>>>>>
>>>>>> $ wget -O - https://www.shentel.com/robots.txt
>>>>>> --2019-11-03 15:22:35--  https://www.shentel.com/robots.txt
>>>>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21
>>>>>> Connecting to www.shentel.com
>>>>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request
>>>>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36
>>>>>> ERROR 403: Forbidden.
>>>>>>
>>>>>> Allowing said bots to *see* your robots.txt would be a step into
>>>>>> the right direction.
>>>>>
>>>>> But you are asking for shentel.com/robots.txt which is my isp.
>>>>> You should be asking for
>>>>>
>>>>> http://geneslinuxbox.net:6309/gene/robots.txt
>>>>
>>>> Wow. You sir owe me a new set of eyes.
>>>
>>> Chuckle :) That was the default I'd pickup up from someplace years
>>> ago.
>>>
>>>> I advise you to compare your monstrosity to this (a hint - it does
>>>> work) - [1].
>>>>
>>>> Reco
>>>>
>>>> [1] https://enotuniq.net/robots.txt
>>>
>>> I'll trim mine forthwith to the last entry.  I've wondered if that
>>> was too long a list. And restart apache2 of course. But now I see
>>> the next access is not a 200, but a 404, that not intended. From the
>>> access log:
>>>
>>> coyote.coyote.den:80 209.197.24.34 - -
>>> [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4
>>> HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64;
>>> Trident/7.0; rv:11.0) like Gecko"
>>>
>>> that directory exists, shouldn't that have been a 200?
>>
>> The directory might exist but it is not accessible.
>>
>> --
>> John Doe
> Universal read perms would be 444. does it need any more than that to be 
> downloadable?

IIRC all the directories back to / need to be executable as well as
readable, by the web server.

Richard

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: