[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: About spam in the list archive



Hi Manoj,

Manoj Srivastava wrote:
>         Hmm. I'll be happy to help automate some of the decision making
>  using my Spam classification mechanisms; please look at 
>    http://www.golden-gryphon.com/software/spam/crm114_accuracy.html
>  to see the lower bound on accuracy I get from (mostly) Debian email.
>  Adding SA to the CRM114  results above gives about 99.92% accuracy
>  overall -- and crm114 has had 100% accuracy in identifying Spam in the
>  last two years I have been using it.

If you have suggestions how automatic testing can be incorporated into
the a spam-removal process in a way that is acceptable to the project,
I'd be very happy to seem them discussed here. However I'm not sure that
the bias that we (there are six people currently seeing how things work)
currently impose in our manual review can be very well implemented in
software.
What to do with "sponsorship request spam" from people claiming to be
students or clans, what to do with foreign language spam that people
reply to with translation and the explanation "ignore, this is spam",
what to do with the reply?

>         It would be interesting to see how many messages escape my
>  filters, and give me an opportunity to further train them. All I need
>  would be the mbox file; and for me to setup a process to feed the email
>  to the filters, and classify the result -- and then send back the
>  message ID's of Ham and Spam back to Debian.

There is a couple of almost-mboxes linked from [1].
Before the first "From " there is a mbox-like header but from there on
it is a regular mbox archive consisting of the nominations.
Preliminary results indicate that around 2/3 of the submissions for
debian-project are actually removal candidates (based on review by pabs
and me, there are others looking at the same things).
The information in the initial headers should be fairly
self-explanatory, the number besides year, month, and message number is
the number of times this a message was reported as spam.

I can easily put up more of these, of course, just tell me what you
want. (There are ca. 90000 nominated messages, it is unclear to me
whether old data is equally usable as newer.)

Kind regards

Thomas

1. http://wiki.debian.org/Teams/ListMaster/ListArchiveSpam
-- 
Thomas Viehmann, http://thomas.viehmann.net/



Reply to: