[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: The "clean out spam from archives" effort is lagging



On Mon, Nov 2, 2009 at 1:01 AM, Christian Perrier <bubulle@debian.org> wrote:
> Quoting Lee Winter (lee.j.i.winter@gmail.com):
>
>> I did the most recent three months of 2009, but the density was pretty low.
>
> I haven't checked the wiki and  I'm not online right now, but please
> take care to register this in the page.

I am a little hesitant to edit the page because I don't understand the
process and found no doc or howto.

>
>>
>> > Old archives are also missing reviews, particularly a few from 2005
>> > and nearly all from 2004, not to mention older archives.
>>
>> So I started at the beginning (part of 1998) and went to the end of
>> 2002.  If I have time this week I will look at 2003-2005.
>
> Ditto.
>
>> > Please take some time to do this work. This is not that time
>> > consuming: one month can be reviewed in about 10-15 minutes....even
>> > less when you're used to methods for spotting spams.
>>
>> The work is pretty tedious and reviewing non-spam emails five time is
>> extremely inefficient.  Consider a solution that would allow one
>> person to scan the archive to generate a list of spam targets.  If the
>> other four reviewers only had to review the listed spam candidates
>> they would not have to waste their time reviewing non-spam.
>
> I'm sure the listmasters would welcome such improvements but, well, we
> already have a very good tool.
>
> Also, restricting the list to what the first person has identified
> would increase the risk of missing some spams.
>
> When I worked on the entire archive, I finally dropped the web
> interface and used an alternative method:
>
> - download the list archives as mailboxes
> - pass them through my CRM114 spam filter
> - open them in my MUA (mutt)
> - tag spam messages (being processed by CRM114, most spams are already
> identified by CRM114 markers)
> - bounce them to the spam report mail addresse
> (report-listspam@lists.debian.org) with the following key macro:
>
> macro index \eL "breport-listspam@lists.debian.org\no\nq" "report as spam to Debian lists"
>
> I found this much more efficient.

Sounds like the beginning/foundation of an automation script.  If the
candidates can be found mechanically, then there is a potential
tradeoff available.  We have 11 years = 132 months; times 5 reviewers
= 660 reviewer-months.  At 10-15 min each that is 110-165 man-hours.
That's a lot of manual effort.

Just how important are the last few messages that would make it
through a (purposfully loose) mechanical filter?  If the whole mess
could be 98% cleaned up with say, 5 man-hours then it would be a
tremendous efficiency improvement.

> Downloading list archives as mailboxes is only accessible to Debian
> developers but I can provide them to people who might need them.

In the '80s I spent a lot of time doing natural language processing
software, so I may be more tuned up than the typical reviewer.  But I
find it more efficient to review the author/subject/thread indicies
and inspect message content only to confirm the presence of spam in a
suspect message.  So offline access to the archive would not help me.

-- Lee


Reply to: