Re: About spam in the list archive
On Wed, 14 Nov 2007 16:51:42 +0100, Thomas Viehmann <firstname.lastname@example.org> said:
> Hi Manoj,
> If you have suggestions how automatic testing can be incorporated into
> the a spam-removal process in a way that is acceptable to the project,
> I'd be very happy to seem them discussed here. However I'm not sure
> that the bias that we (there are six people currently seeing how
> things work) currently impose in our manual review can be very well
> implemented in software. What to do with "sponsorship request spam"
> from people claiming to be students or clans, what to do with foreign
> language spam that people reply to with translation and the
> explanation "ignore, this is spam", what to do with the reply?
If you can get a corpus of past messages from people claiming to
be students or clans, we can fine tune a crm114 filter to identify
these mails. Given the narrow range of the messages we are classifying,
I am pretty sure that the number of "unsure/retrain" messages that
humans need to ponder over can be reduced by 2-3 orders of magnitude.
There is no reason to only have one filtering pass; especially
since we are not dealing with a streaming incoming mail.
My take on this is that we automate the process by passing the
unknown mbox through the crm114+SA filter, and classify the mail into
Ham, Unsure, and Spam.
The Unsure would be manually inspected, and used to further
train the filters; as well as any erroneous classification
(TOE). Periodically, we TUNE (Train Until No Error) the Corpus.
Hey, if this works as well for list mail as it does for me (and
my email is mostly Debian list mail), it might even work to filter
incoming mail. But we'll see.
>> It would be interesting to see how many messages escape my filters,
>> and give me an opportunity to further train them. All I need would be
>> the mbox file; and for me to setup a process to feed the email to the
>> filters, and classify the result -- and then send back the message
>> ID's of Ham and Spam back to Debian.
> There is a couple of almost-mboxes linked from . Before the first
> "From " there is a mbox-like header but from there on it is a regular
> mbox archive consisting of the nominations. Preliminary results
> indicate that around 2/3 of the submissions for debian-project are
> actually removal candidates (based on review by pabs and me, there are
> others looking at the same things). The information in the initial
> headers should be fairly self-explanatory, the number besides year,
> month, and message number is the number of times this a message was
> reported as spam.
> I can easily put up more of these, of course, just tell me what you
> want. (There are ca. 90000 nominated messages, it is unclear to me
> whether old data is equally usable as newer.)
I can try setting up infrastructure to classify a mbox (create a
new user, write a simple script to parse mbox, feed mails to crm114+SA,
and use mailagent to filter into ham, Spam, and unsure).
I have a dog-and-pony show coming up Dec 3rd, so I might not be
very responsive, at least until I am sure my software is working, but
I'll grab a mbox and see what the results look like.
If the setup is mostly working, future mbox's can be handled
If you have a human scanned set of list mails known to be Spam,
or known to be ham, etc, I can use those either to augment my Corpus,
or to use in place of my personal Corpus, to better reflect your
judgement of what is or is not Spam.
Time flies like an arrow, fruit flies like a banana. Frequently
attributed to Groucho Marx
Manoj Srivastava <email@example.com> <http://www.debian.org/~srivasta/>
1024D/BF24424C print 4966 F272 D093 B493 410B 924B 21BA DABB BF24 424C