Re: Debian mailing lists archives as mbox (was: Re: [Soc-coordination] Debian Teams Activity Metrics - Report IV) [Update]

On Thu, Aug 04, 2011 at 09:44:49AM +0200, Alexander Wirt wrote:
> We had an ongoing discussion about privacy and so
> spam and so on about the mboxes. We even managed to get consense yesterday.

To bring some light into this I would like to publish this consense we

  A filter needs to be written (most probably this will be done
  by Sukhbir who should test this on any mbox because he is not
  allowed to access original mboxes).  The filter should have the
  following features:

  - Parse the existing mboxes and strip them down to the following

     Message-id: <ID>
     From: Name of poster <e-mail@of.poster>
     Date: Date
     Subject: Subject
  - Remove those Message-IDs which should be removed (just
    detected SPAM)
  - Publish these mboxes (it was not yet specified by listmaster
    whether for general http download or only for specific users)

  The filter will be written in Python because this is Sukhbirs
  prefered language and listmaster accepted this as an exception
  even if they would have prefered Perl.

So far for the consensus we had reached in private discussion.  I did
not got a final yes for my suggestion to include the following
information which I regard as helpful as well:


IMHO the first two might be helpful to reconstruct threads (so this
information is at least implicitely inside the web archive - at least
to my poor understanding).

I also regard the X-Spam fields as valuable information which is
irrelevant for privacy but most probably quite usefull for other
purposes like further SPAM removals.

If you regard some other fields interesting but not critical for
privacy issues it might be the right moment to speak up now.

Hope this will be helpful to find a reasonable solution.

Kind regards



