[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Looking for debian-gb list achieve for anti-spam research

Hi all, 
  First, congratulation on the revival of the list!

  These days I am very annoyed by the increasing volumes of spam (junk) 
  mails. Although there are many anti-spam systems for English (eg.  
  SpamAssassin,  Spambayes, PopFile...), so far I have not find a system 
  that can handle Chinese messages gracefully. Since I know how such 
  filtering system works (it's a binary text categorization problem),  I 
  decide to build an Anti-Spam system for Chinese. 

  I have made some experiments on my own 1633 Chinese mails (1205 are
  spam), the filtering accuracy is around 98% (I'm happy). However, in
  order to build a large scale public available Chinese spam filter I
  need a much larger public corpus (I can not release my own filter
  since it is trained on my own mails).  The problem is that in order to
  train such a filter, I need both spam and legitimate mails.  It is
  relatively easy to collect spams (spamachieve.org), but it is hard to
  collect legitimate mails, especially from the same list.  Collecting
  spams from a personal address without getting legitimate messages
  corresponding to the same period does not reflect the real spam
  distribution.  Consequently, the statistical model based on such
  sources will not generalize well in real world scenario. (For example,
  a filter trained on my mbox may not work well when installed on

  That's why I'm interested in debian-chinese-gb: this list is heavy
  spammed and all legitimate mails can be used freely without privacy
  problem.  Since I can search for old mails from lists.debain.org I
  guess all the archives are stored somewhere on lists.debain.org. Could
  you tell me where can I get these list achieves? In case the list
  achieve is not accessible, could anyone send your mbox/maildir/mh
  files to me (please contact me before sending)? 

  I promise when the project (open source) become stable I will submit a
  patch to SpamAssassin or Spambayes. And I will upload the
  pre-processed training data somewhere on the net.

Thanks for your help in advance.

Sincerely yours, 
Zhang Le

Reply to: