[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: lists.debian.org vs google groups

Pascal Hakim wrote:


On Thu, Apr 06, 2006 at 09:51:30AM +0100, Doofus wrote:
Can you quote:

I can't do the last twelve months, as we don't keep our data that far
back, and some of these numbers have to be counted invidually, but
here are the numbers for March.

Due to our multi step filtering process I can't even get numbers for
the whole of March, but we can make some assumptions.

1. the total number of posts from all sources received by the d-u list servers in the last twelve months,

The first step is dropping things at the MTA stage. Those logs don't go
back that far as they get pretty big. I've picked a full 7 days at
random that to be used as a sample and we get: 5891.

Since we're playing with round figures anyway, let's say that works out
at: 4.3 * 5891 = ~25300

All the rest of the numbers are for March:
CrossAssassin: 7375
SpamAssassin: 4672
Other filters: 333
		-> subtotal: 12380
Total blocked spam: ~37700

Actual messages pushed through the list: 3404

Total trapped spam > 91.5%
Good show.

2. the number of posts received by non list members in the same period,

This can mean two things. If you want the numbers above but for
non-subscribers only, we can't do that for a large chunk of them, and
it would take too long for the rest.

If you want to know simply how many posts were made by non-subscribers
that then made it to the list and were posted, it's 862.

As Hendrik pointed out, I did of course mean *from* non list members.

I see no ambiguity in "how many posts were received from non list members [in sample period]?", and can't see how you could reach your second interpretation above. It's an interesting point though, and 25% of legitimate posts originating from non subscribers certainly strengthens the case for an open list.

3. the number posts actually published on d-u after all filtering in the same period



4. the number spam (or non-spam) posts actually published on d-u in the same period?

I went through the archive for March, and pulled out the numbers. I
found 25 spam messages[1], which leaves us with 3379 valid messages.

Not a lot really, I concede.

The answers to these should go some way to highlight the scale of the problem, and also how much benefit is gained by allowing everyone in the world aim their crap at all of our mailboxes. I'll be surprised if a statistic is available for (4), but would appreciate the answers if they're available.

Even if we assume that I fell asleep on the page down key while counting
4., and guess that I missed half, we're still talking about blocking
over 800 valid messages.

25/37700 works out to be 0.066% of spam not being blocked. It's still
annoying of course, as the metric to use is the number of spam messages
that make it through rather than the percentage that make it through.
SNR and all that.

No, we need all the numbers. Only percentages describe the efficiency of the filtering. Your figures indicate some pretty impressive filtering though...

I could have asked another question: How much of the spam that gets through originates from non list members? I'll have a guess - all of it. What exactly *is* the argument for allowing non subscribers to post? All answers other than "debian=blind freedom" appreciated.

And thanks for your effforts and answer Pasc.

Reply to: