[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: swamp rat bots Q



Just goes to show if you whinge hard enough and pretend to be useless it will irritate someone enough to do your work.


You don't even bother to check your robots.txt and complained about some evil bot all year. Priceless.



On 5/12/20 8:37 am, Tixy wrote:
On Fri, 2020-12-04 at 17:06 -0500, Gene Heskett wrote:
On Friday 04 December 2020 16:14:29 Tixy wrote:

On Fri, 2020-12-04 at 14:51 -0500, Gene Heskett wrote:
On Friday 04 December 2020 12:39:24 Reco wrote:
       Hi.

On Fri, Dec 04, 2020 at 08:39:42AM -0500, Gene Heskett wrote:
But I asked specifically how to enable it for one bot, and
I've
asked that question several times, getting smoke and mirror
answers you all assume are helpfull, but which are useless to
a
new user installing the now 7 years old and long out of date
package that in effect has no "how it works" docs. I asked 3
questions in a previous day or so timeline, and no one has
actually attempted to actually answer even one of them. Here
is
one line from that log: and that I just blocked:

coyote.coyote.den:80 192.99.6.226 - -
[04/Dec/2020:07:18:20 -0500] "GET
/gene/toolshed/c3/build/win32/prep/?C=S;O=D HTTP/1.1" 200 673
"-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
http://mj12bot.com/)"
Taken directly from the link.

Bot Type         Good crawler (always identifies itself)
IP Range         Distributed, Worldwide
Obeys Robots.txt *Yes*
Sorry, they do not, they've read it and ignored it 428 times in
the
life of that log which I zeroed out around 1 July of this year.
Why would they read it if they we're going to just ignore it,
perhaps
your robots.txt is broken? Hint, it is, in 2 or 3 different ways I
can
see (if it's http://geneslinuxbox.net:6309/robots.txt we're talking
about). That file doesn't have any syntactically correct entry in
there for blocking that bot.
And what might that be like, I'll fix it right now
OK, I'll do your proofreading...

At the end of the robots.txt you are missing a colon from a rule that
disallows everything for all bots...

User-agent *
Disallow: /

That should be:

User-agent: *
Disallow: /

But if you just want to disable the bot you reckon is a problem, the
front page of their site (https://mj12bot.com/) says you want:

User-agent: MJ12bot
Disallow: /

Or you could read their page to see the robots.txt syntax for slowing
down crawling, which I assume is relevant to other bots to you may have
problems with.

The other rules above your disallow everything (which are superfluous
if you keep that final rule) also have typos, you have a '0' here...

User-0agent: *
Disallow: /doc/

And this rule has a space in the URL...

User-agent: *
Disallow: stress test

I'm pretty sure URLs can't have actual space characters in them and
that must be a typo on your behalf. Also something I read when looking
at this issue a few hours ago (but can't find again) reckoned that
Google's bot let you have multiple statements on a line separated by
spaces, e.g.

Disallow: foo Disallow: bar

So it seems likely that having a space in the URL like you have isn't
legal, and could possibly upset parsing.

--
.....I'VE GOT BLISTERS ON MY FINGERS!.....


Reply to: