[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Proposed removal of spam from the debian-project mailing list web archives



Hi,

four people have checked the spam web form submissions concerning
debian-project. More background can be found at [1]. Thanks to Bas
Wijnen, Paul Wise, and Richard Hecker for reviewing! (Of course, a
special mention to Y Giridhar Appaji Nag who already looked through
debian-devel, but that isn't ripe for action yet.)

Proposal
--------

I propose to remove the 436 messages unanimously classified "spam" from
the web archive.[2]

Note, these will remain available to Devlopers on master.debian.org and
messages will be reincluded if complaints about an erroneous removal are
received by the Listmaster, as discussed at [1] (Policy corner stones).

Some statistics
---------------

Number of messages by range of classification responses (the four
possible responses are explained at[1]):

839 submissions reviewed

436 spam
225 not spam
  6 inapp
  1 unknown
 68 unknown, spam
 33 unknown, not spam
 18 inapp, spam
  9 unknown, inapp
  3 not spam, inapp
 17 unknown, spam, inapp
  8 unknown, not spam, spam
  5 not spam, spam
  2 spam, not spam, inapp
  4 unknown, not spam, inapp
  4 unknown, inapp, not spam, spam

Analysis of the debian-project review
-------------------------------------

We should be most concerned about the messages with (detected) errors,
namely those where the answers contain both "spam" and "non-spam", so
below are the message-ids (best used in conjunction with[3]) and some
analysis of the nature of these messages.

While an error estimate would be nice to have, the naive approach is
based on an independence assumption that seems to be very wrong in our case.

I think that improved tools (quicker access to the web pages with the
"next in thread" links or using the web page, in particular), experience
for the corner cases, and triple review (including some experienced
spam-checker) is a good balance of reliability and effort. (I would even
claim that we there is nothing of particular value that received two
spam votes, but we want to be sure and loose as little as possible.)

  hecker       pabs  tviehmann     wijnen
--- one spam vote
not spam      inapp    unknown       spam
        courier.44194498.00006B55@softhome.net
        a request to remove stuff from the archive
    spam   not spam   not spam      inapp
        000a01c2f63a$55e0ff20$7827fea9@computer
        a German user complaining about Debian CDs he bought elsewhere
    spam    unknown   not spam    unknown
        MABBLMGCBFOBPPCNFDIIOENGCCAA.mario_capuano@katamail.com
        an Italian user question
not spam    unknown    unknown       spam
        C03C7E6F.2EAF%jonas.hedlund@trigger.se
        someone complaining about ICQ spam matching some list spam
    spam    unknown   not spam      inapp
        1be.ebc747e.2ac4b275@aol.com
        a German user looking for a translation program
    spam   not spam   not spam   not spam
        20020821160713.GA8194@despayre.org
        a complaint about IRC in response to an DWN article
    spam    unknown   not spam      inapp
        21916A3354A3D511946800508BB9A9F5083D0991@svntexc2.gvt.net.br
        a Portuguese user question
    spam   not spam   not spam      inapp
        003d01c2ad91$64746cd0$0301a8c0@MATTHIAS
        a German (Swiss) request to be sent a t-shirt to match the swirl
        on his motor scooter
    spam   not spam   not spam   not spam
        NHBBKODDALCBAECNDDJNMEDNCAAA.boufatit@sarpi-dz.com
        a  French and English user question
    spam   not spam    unknown   not spam
        e6a527b20606150242i2dc527a1o97d144dc9563df9@mail.gmail.com
        start of a troll thread
    spam   not spam    unknown   not spam
        e6a527b20606151719w29f74ec5o88e3cd7914028855@mail.gmail.com
        further down that troll thread
not spam   not spam   not spam       spam
        004401c3aa2b$00c6a580$6501a8c0@mrfish
        an offer to redesign our web site, possibly serious
    spam    unknown   not spam      inapp
        000801c2d8be$a61958a0$c13f243e@39y8vr2w2kpw8tg
        a Spanish user question
not spam    unknown    unknown       spam
        050901c3f55f$e5a50000$0202a8c0@hotbox
        a Linux portal announcement at least bordering spam
--- two spam votes
    spam    unknown   not spam       spam
        1665599482.20060206122551@matic.com.pl
        a Polish user question
    spam    unknown   not spam       spam
        web-26275475@mail5.rambler.ru
        someone looking (in a strange way) for someone with the the same
        name as a Debian contributor who has some 256 posts on our
        English language lists between 1999/09 and 2001/10
    spam       spam   not spam    unknown
        E1AI3qU-0000YN-00@gluck.debian.org
        a Spanish unsolicited software survey not directly related to
        Debian
--- three spam votes
    spam   not spam       spam       spam
        000801c2f43d$c5b2b180$0d00a8c0@laszlo
        a Croatian (one-liner) user question
--- unquestionably spam
not spam       spam       spam       spam
        5.2.0.9.1.20030531203449.03c75de8@pop.videotron.ca
        link request spam

Kind regards

Thomas

1. http://wiki.debian.org/Teams/ListMaster/ListArchiveSpam
   and originally, with followups, on this mailing list
   http://lists.debian.org/debian-project/2007/11/msg00012.html
2. In master.d.o:~tviehmann/spam-removals/ you will find
   "reports" and "proposed" removals and the python (>=2.4) script
   comparing them. The .spam files actually used reside with the
   mbox archives on master:/org/lists.debian.org/lists/,
   presently only four Listmaster-removed spams.
3. http://lists.debian.org/msgid-search/
   use http://lists.debian.org/msgid-search/%s for quick bookmarks
-- 
Thomas Viehmann, http://thomas.viehmann.net/



Reply to: