Looking for debian-gb list achieve for anti-spam research
Hi all,
First, congratulation on the revival of the list!
These days I am very annoyed by the increasing volumes of spam (junk)
mails. Although there are many anti-spam systems for English (eg.
SpamAssassin, Spambayes, PopFile...), so far I have not find a system
that can handle Chinese messages gracefully. Since I know how such
filtering system works (it's a binary text categorization problem), I
decide to build an Anti-Spam system for Chinese.
I have made some experiments on my own 1633 Chinese mails (1205 are
spam), the filtering accuracy is around 98% (I'm happy). However, in
order to build a large scale public available Chinese spam filter I
need a much larger public corpus (I can not release my own filter
since it is trained on my own mails). The problem is that in order to
train such a filter, I need both spam and legitimate mails. It is
relatively easy to collect spams (spamachieve.org), but it is hard to
collect legitimate mails, especially from the same list. Collecting
spams from a personal address without getting legitimate messages
corresponding to the same period does not reflect the real spam
distribution. Consequently, the statistical model based on such
sources will not generalize well in real world scenario. (For example,
a filter trained on my mbox may not work well when installed on
debian-chinese-gb)
That's why I'm interested in debian-chinese-gb: this list is heavy
spammed and all legitimate mails can be used freely without privacy
problem. Since I can search for old mails from lists.debain.org I
guess all the archives are stored somewhere on lists.debain.org. Could
you tell me where can I get these list achieves? In case the list
achieve is not accessible, could anyone send your mbox/maildir/mh
files to me (please contact me before sending)?
I promise when the project (open source) become stable I will submit a
patch to SpamAssassin or Spambayes. And I will upload the
pre-processed training data somewhere on the net.
Thanks for your help in advance.
Sincerely yours,
--
Zhang Le
Reply to: