[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: swamp rat bots Q



On Fri, 2020-12-04 at 17:06 -0500, Gene Heskett wrote:
> On Friday 04 December 2020 16:14:29 Tixy wrote:
> 
> > On Fri, 2020-12-04 at 14:51 -0500, Gene Heskett wrote:
> > > On Friday 04 December 2020 12:39:24 Reco wrote:
> > > >       Hi.
> > > >
> > > > On Fri, Dec 04, 2020 at 08:39:42AM -0500, Gene Heskett wrote:
> > > > > But I asked specifically how to enable it for one bot, and
> I've
> > > > > asked that question several times, getting smoke and mirror
> > > > > answers you all assume are helpfull, but which are useless to
> a
> > > > > new user installing the now 7 years old and long out of date
> > > > > package that in effect has no "how it works" docs. I asked 3
> > > > > questions in a previous day or so timeline, and no one has
> > > > > actually attempted to actually answer even one of them. Here
> is
> > > > > one line from that log: and that I just blocked:
> > > > >
> > > > > coyote.coyote.den:80 192.99.6.226 - -
> > > > > [04/Dec/2020:07:18:20 -0500] "GET
> > > > > /gene/toolshed/c3/build/win32/prep/?C=S;O=D HTTP/1.1" 200 673
> > > > > "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
> > > > > http://mj12bot.com/)"
> > > >
> > > > Taken directly from the link.
> > > >
> > > > Bot Type         Good crawler (always identifies itself)
> > > > IP Range         Distributed, Worldwide
> > > > Obeys Robots.txt *Yes*
> > >
> > > Sorry, they do not, they've read it and ignored it 428 times in
> the
> > > life of that log which I zeroed out around 1 July of this year.
> >
> > Why would they read it if they we're going to just ignore it,
> perhaps
> > your robots.txt is broken? Hint, it is, in 2 or 3 different ways I
> can
> > see (if it's http://geneslinuxbox.net:6309/robots.txt we're talking
> > about). That file doesn't have any syntactically correct entry in
> > there for blocking that bot.
> 
> And what might that be like, I'll fix it right now

OK, I'll do your proofreading...

At the end of the robots.txt you are missing a colon from a rule that
disallows everything for all bots...

User-agent *
Disallow: /

That should be:

User-agent: *
Disallow: /

But if you just want to disable the bot you reckon is a problem, the
front page of their site (https://mj12bot.com/) says you want:

User-agent: MJ12bot
Disallow: / 

Or you could read their page to see the robots.txt syntax for slowing
down crawling, which I assume is relevant to other bots to you may have
problems with.

The other rules above your disallow everything (which are superfluous
if you keep that final rule) also have typos, you have a '0' here...

User-0agent: *
Disallow: /doc/

And this rule has a space in the URL...

User-agent: *
Disallow: stress test

I'm pretty sure URLs can't have actual space characters in them and
that must be a typo on your behalf. Also something I read when looking
at this issue a few hours ago (but can't find again) reckoned that
Google's bot let you have multiple statements on a line separated by
spaces, e.g.

Disallow: foo Disallow: bar

So it seems likely that having a space in the URL like you have isn't
legal, and could possibly upset parsing.

-- 
Tixy


Reply to: