Bug#569191: marked as done (crawler not allowed to perform ?action=raw)
Your message dated Tue, 11 May 2010 08:26:51 +0200
with message-id <1273559211.3503.222.camel@solid.paris.klabs.be>
and subject line [Debian Wiki] crawler not allowed to perform ?action=raw
has caused the Debian Bug report #569191,
regarding crawler not allowed to perform ?action=raw
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)
--
569191: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=569191
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
- To: Debian Bug Tracking System <submit@bugs.debian.org>
- Subject: libwww-perl: GET behavior changed in squeeze: URL stopped working
- From: "Andreas B. Mundt" <andi.mundt@web.de>
- Date: Wed, 10 Feb 2010 18:16:03 +0100
- Message-id: <20100210171603.5858.26573.reportbug@flashgordon>
Package: libwww-perl
Version: 5.834-1
Severity: important
Hi,
we use GET to download a wikipage and further process the data to
prepare the manual of Debian Edu. The command:
GET "http://wiki.debian.org/DebianEdu/Documentation/Lenny/AllInOne?action=raw"
works fine in Lenny, but stopped working in squeeze where "You are not
allowed to access this!" is returned. If you remove "?action=raw" from
the URL anything is fine. Is this inteded and we have to provide a
header?
Regards,
Andi
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.32-nouveau.git (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages libwww-perl depends on:
ii libhtml-parser-perl 3.64-1 collection of modules that parse H
ii libhtml-tagset-perl 3.20-2 Data tables pertaining to HTML
ii libhtml-tree-perl 3.23-1 represent and create HTML syntax t
ii liburi-perl 1.52-1 module to manipulate and access UR
ii netbase 4.40 Basic TCP/IP networking system
ii perl 5.10.1-9 Larry Wall's Practical Extraction
Versions of packages libwww-perl recommends:
ii libhtml-format-perl 2.04-2 format HTML syntax trees into text
ii libio-compress-perl 2.022-1 IO::Compress modules
ii libmailtools-perl 2.05-1 Manipulate email in perl programs
ii perl [libio-compress-perl] 5.10.1-9 Larry Wall's Practical Extraction
Versions of packages libwww-perl suggests:
ii libcrypt-ssleay-perl 0.57-2 Support for https protocol in LWP
ii libio-socket-ssl-perl 1.31-1 Perl module implementing object or
-- debconf-show failed
--- End Message ---
--- Begin Message ---
retitle 569191 crawler not allowed to perform ?action=raw
thanks
Andreas B. Mundt wrote:
> we use GET to download a wikipage and further process the data to
> prepare the manual of Debian Edu. The command:
> GET "http://wiki.debian.org/DebianEdu/Documentation/Lenny/AllInOne?action=raw"
> works fine in Lenny, but stopped working in squeeze where "You are not
> allowed to access this!" is returned. If you remove "?action=raw" from
> the URL anything is fine. Is this inteded and we have to provide a
> header?
Damyan Ivanov wrote:
> On Lenny (works)
> ================
> User-Agent: lwp-request/0.810
>
> On Sid (breaks)
> ===============
> User-Agent: lwp-request/5.834 libwww-perl/5.834
Yes, this is moinmoin standard behavior.
The wiki engine has some surge protection mechanisms, to avoid web
crawlers (and users) from DoS'ing the wiki.
Well known web crawlers (including libwww-perl/*) are only allowed to
fetch html rendered pages.
As it was mentioned, you should change your crawler's user-Agent string
(use something meaningful, so the admin can get in touch with you,
rather than just blacklisting the "offending" IPs)
Thanks,
Franklin
--- End Message ---
Reply to: