[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Proposed removal of spam from the debian-project mailing list web archives


four people have checked the spam web form submissions concerning
debian-project. More background can be found at [1]. Thanks to Bas
Wijnen, Paul Wise, and Richard Hecker for reviewing! (Of course, a
special mention to Y Giridhar Appaji Nag who already looked through
debian-devel, but that isn't ripe for action yet.)


I propose to remove the 436 messages unanimously classified "spam" from
the web archive.[2]

Note, these will remain available to Devlopers on master.debian.org and
messages will be reincluded if complaints about an erroneous removal are
received by the Listmaster, as discussed at [1] (Policy corner stones).

Some statistics

Number of messages by range of classification responses (the four
possible responses are explained at[1]):

839 submissions reviewed

436 spam
225 not spam
  6 inapp
  1 unknown
 68 unknown, spam
 33 unknown, not spam
 18 inapp, spam
  9 unknown, inapp
  3 not spam, inapp
 17 unknown, spam, inapp
  8 unknown, not spam, spam
  5 not spam, spam
  2 spam, not spam, inapp
  4 unknown, not spam, inapp
  4 unknown, inapp, not spam, spam

Analysis of the debian-project review

We should be most concerned about the messages with (detected) errors,
namely those where the answers contain both "spam" and "non-spam", so
below are the message-ids (best used in conjunction with[3]) and some
analysis of the nature of these messages.

While an error estimate would be nice to have, the naive approach is
based on an independence assumption that seems to be very wrong in our case.

I think that improved tools (quicker access to the web pages with the
"next in thread" links or using the web page, in particular), experience
for the corner cases, and triple review (including some experienced
spam-checker) is a good balance of reliability and effort. (I would even
claim that we there is nothing of particular value that received two
spam votes, but we want to be sure and loose as little as possible.)

  hecker       pabs  tviehmann     wijnen
--- one spam vote
not spam      inapp    unknown       spam
        a request to remove stuff from the archive
    spam   not spam   not spam      inapp
        a German user complaining about Debian CDs he bought elsewhere
    spam    unknown   not spam    unknown
        an Italian user question
not spam    unknown    unknown       spam
        someone complaining about ICQ spam matching some list spam
    spam    unknown   not spam      inapp
        a German user looking for a translation program
    spam   not spam   not spam   not spam
        a complaint about IRC in response to an DWN article
    spam    unknown   not spam      inapp
        a Portuguese user question
    spam   not spam   not spam      inapp
        a German (Swiss) request to be sent a t-shirt to match the swirl
        on his motor scooter
    spam   not spam   not spam   not spam
        a  French and English user question
    spam   not spam    unknown   not spam
        start of a troll thread
    spam   not spam    unknown   not spam
        further down that troll thread
not spam   not spam   not spam       spam
        an offer to redesign our web site, possibly serious
    spam    unknown   not spam      inapp
        a Spanish user question
not spam    unknown    unknown       spam
        a Linux portal announcement at least bordering spam
--- two spam votes
    spam    unknown   not spam       spam
        a Polish user question
    spam    unknown   not spam       spam
        someone looking (in a strange way) for someone with the the same
        name as a Debian contributor who has some 256 posts on our
        English language lists between 1999/09 and 2001/10
    spam       spam   not spam    unknown
        a Spanish unsolicited software survey not directly related to
--- three spam votes
    spam   not spam       spam       spam
        a Croatian (one-liner) user question
--- unquestionably spam
not spam       spam       spam       spam
        link request spam

Kind regards


1. http://wiki.debian.org/Teams/ListMaster/ListArchiveSpam
   and originally, with followups, on this mailing list
2. In master.d.o:~tviehmann/spam-removals/ you will find
   "reports" and "proposed" removals and the python (>=2.4) script
   comparing them. The .spam files actually used reside with the
   mbox archives on master:/org/lists.debian.org/lists/,
   presently only four Listmaster-removed spams.
3. http://lists.debian.org/msgid-search/
   use http://lists.debian.org/msgid-search/%s for quick bookmarks
Thomas Viehmann, http://thomas.viehmann.net/

Reply to: