[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Venting about forums.debian.net



On Tue, Jan 20, 2026 at 12:03:38 +0000, Steve McIntyre wrote:
> svetlana@members.fsf.org wrote:
> >Are they not required to follow do_not_track http headers or robots.txt ? If LLM robots do not obey these
> >instructions, they should be probably reported to their hosting provider.
> 
> Hahahaha. No.
> 
> The current crop of LLM morons do not care at all about following
> accepted rules or norms. They just want to grab all the data, screw
> everybody else. They're ignoring robots.txt, so service admins started
> blocking netblocks some time ago.
> 
> Now we have the LLM morons using botnets to evade those blocks. We
> have massive amounts of downloads coming from random residential IPs
> all over the world, carefully spread out to make it more difficult to
> block them.
> 
> These morons are why we can't have nice things.

I can confirm this.  My own wiki was slammed really hard by this,
resulting in my having to take substantial actions to limit the
availability of some "pages".

The issue isn't even that the LLM bots are harvesting every wiki page.
If it were only that, I wouldn't mind.  The first problem is wikis allow
you to request the difference between any two revision of a page.  So,
let's say a page has 100 revisions.  You can request the diff between
revision 11 and revision 37.  Or the diff between revision 14 and
revision 69.  And so on, and so on.

What happens is the bots request *every single combination* of these
diffs, each one from a random IP address, often (but not always) with
a falsified user-agent header.

I've blocked all the requests that give a robotic user-agent, but there's
really nothing I can do about the ones that masquerade as Firefox or
whatever, unless I need to take it a step further and block all requests
that ask for a diff.  I haven't had to do that yet.  Maybe the LLM herders
have finally put *some* thought into what they're doing and reduced the
stupidity level...?  Dunno.

Compounding that, MoinMoin has some sort of bizarre calendar thing
that I've never used and don't really understand.  But apparently
there's a potential page for every single date in a range that spans
multiple centuries.  I've deleted all of those pages *multiple* times,
but spam bots got those pages into their "try to edit" caches, so they
kept coming back.  Meanwhile, the LLM harvesters got those pages into
their "try to fetch" caches, so they would keep requesting them, even
though the pages didn't exist any longer.

So, another action I had to take was to block every request that tries
to hit one of those calendar pages, at the web server level, before it
could even make it to the wiki engine.

So, I've got this:

# less /etc/nginx/conf.d/badclient.conf 
map $http_user_agent $badclient {
        default                 0;

        "~BLEXBot/"             1;
        "~ClaudeBot/"           1;
        "~DotBot/"              1;
        "~facebookexternalhit/" 1;
        "~PetalBot;"            1;
        "~SemrushBot"           1;
        "~Thinkbot/"            1;
        "~Twitterbot/"          1;

        "~BadClient/"           2;
}

map $request_uri $badrequest {
        default                 0;
        "~/MonthCalendar/"      1;
        "~/MonthCalendar\?"     1;
        "~/SampleUser/"         1;
        "~/WikiCourse/"         1;
        "~/WikiKurs/"           1;
}

And this:

# less /etc/nginx/sites-enabled/mywiki.wooledge.org 
server {
    listen 80;
    listen 443 ssl;
    server_name mywiki.wooledge.org;

    if ($badclient) {
        return 403;
    }
    if ($badrequest) {
        return 403;
    }
    ...
}

So far, these changes (combined with the brute force removal of the
MonthCalendar et al. pages) have been sufficient.


Reply to: