Re: rogue Chinese crawler

OK, I've now been 24 hours without a hit, so I'm presuming I've got rid
of all the crawlers.

Thanks for all the help and advice from both lists.


- - the openfind.com(.tw) 'bots don't respect the norobots conventions, so
your robots.txt is useless, whatever its contents.
(In fact, these 'bots don't even look for it.)

- - the 'bots are no respecters of any other conventions, either -- they
will stay on your machine for unlimited amounts of time, doing a
constant recursive grab, and gradually freezing out all other activity.
(Worst case on my machine -- 45 minutes)

- - there are 16 'bots, none of which knows what the others are doing.
This means you can have any number on your machine at any one time, each
progressively slowing down the system.
(Worst case on my machine -- 8 simultaneously, for over 30 minutes.)
This can create a virtual DoS attack, or "paralysis" of services.

- - they may not all come from the same address -- I've currently got two
addresses in my rules/directives to drop packets.  (Monitor where
they're coming from.)

- - the originators do NOT reply to e-mails or polite requests to fix
their code to respect the norobots conventions.
The DO respond to abusive e-mails by bouncing any further attempts at
communication with them.

- - as pointed out by almost everyone, the best method of dissuasion is to
drop all packets from thisese sources as they come into the
Failing that, a  Deny from  directive in httpd.conf fixes them good.

- - if the above is implemented, it takes a while for all the 'bots to
learn they're not welcome.

I don't mind well-behaved spiders -- in fact, I welcome them, as no-one
would be able to find some of my pages otherwise -- but these ones go
beyond what is tolerable behaviour for me.  I don't know whether it's
due to bad code, or a "don't care" attitude to others; but I would
advise anyone who finds them clogging up their system to ban them

