This bug was a very serious, almost fatal, bug for me recently, and I
thought I would share my story to emphasize that for me, this was not
a 'wishlist' severity bug.
I research Tor black-markets (see http://www.gwern.net/Silk%20Road )
because I am interested in them from economic, historical, and
statistical perspectives. Black-markets are dangerous risky
enterprises, even when run as Tor hidden-services, and so people like
me or Nicolas Christin often download or spider them so as to have
copies to analyze later.
In October 2013, Silk Road was famously busted (to everyone's complete
surprise). Fortunately, the FBI seizure left the SR forums alone, and
it became a top priority for me to grab a copy of the forums while I
still could since they would be invaluable in the post-mortem of SR
and the wave of arrests everyone expected to follow the bust. The wget
spider of the public forum went fine. But even more importantly, I
needed to get a copy of the members-only subforum, the Vendor
Roundtable, where all the Silk Road drug dealers talked shop, and more
importantly, turned out to have discovered some of the early bits and
pieces of how Silk Road/Ross Ulbricht was busted.
I'm not a drug dealer, but I know a few of the SR ones and was able to
get login credentials. I logged in, checked that I had access to the
Roundtable, exported my cookies, and read the wget man page for
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to
accept or reject. Note that if any of the wildcard characters, *,
?, [ or ], appear in an element of acclist or rejlist, it will be
treated as a pattern, rather than a suffix.
Perfect. Exactly what I needed to avoid being logged out. I threw in a
`--reject '*logout*'` to cover all possible logout links, and kicked
the spider off. I watched for a few minutes, everything looked like it
was going fine with no suspicious 'index.php?logout' files showing up
or anything, and I went off to deal with other aspects of breaking
2 days later, the spider was still running (it's a very big forum and
Tor has high latency), and I needed to check a particular claim about
a Roundtable thread. No problem, I had a copy of the Roundtable - I'd
just check that. NOPE. The thread wasn't there at all. In fact, almost
*nothing* in the Roundtable had been downloaded at all!
I panicked. No one knew why the FBI hadn't shut down the forums, who
was running them, or when they would disappear into the digital ether.
Christin wasn't spidering the Roundtable, and I was it. If I didn't
have a copy, then likely, no one did. It would be gone permanently.
Luckily, the forums were still up... but for how long? Minutes, hours,
or days? What had gone wrong and how could I fix it?
I logged in again, exported cookies, restarted, checked in a few
hours. No Roundtable. WTF?! I logged in, exported, restarted, watched
closely... I spotted in the stream a mention of 'index.php?logout'.
But why? I went back to the `--reject` documentation. Had I called it
wrong? Made a syntax error? Did `--reject` not do what it was supposed
to do? But the documentation is perfectly clear: --reject rejects URLs
from being downloaded. It doesn't do something remotely as absurd as
download a URL and then delete it! There is no usecase for that in
combination with rejecting URLs, it's trivially broken for many
use-cases, and it would *definitely* be documented in the manpage.
I went back, logged in... Repeat 5 or 10 times with various
invocations of `--reject` and regexps and escalating blood pressure,
until I checked the downloaded pages and resigned myself that somehow,
somehow or other, I couldn't begin to explain it, neither the how nor
the why, wget was logging itself out of the forums. As absurd as it
sounded, nothing else fit the evidence.
I started googling 'wget reject'.
To discover this bug report, among others.
Oh how I raged that night. 'principle of least surprise', 'betrayal',
'crime against posterity', 'moronic', 'deliberately malicious', 'what
the hell', and more indelicate phrases were uttered.
I was also not pleased to discover that, `--reject` aside, there was
apparently no way whatsoever to genuinely reject URLs inside wget.
Eventually, I rigged up a hack where I pointed wget to Privoxy, and
wrote Privoxy rules to block certain URLs including the logout links.
It's ugly, it's not easy to modify, I'm not really familiar with
Privoxy syntax, but at least it does, in fact, work. And I was able to
get a good chunk of the Roundtable before the forums went down. (Not
all of it, but that's another story which is not wget's but the forum
software's fault - I think.)
Summary: `--reject` is a problem. It can't be *that* hard to fix,
there are short patches floating around. Please fix it.