Dealing with spam on the mailing lists

To: listmaster@lists.debian.org
Cc: debian-www@lists.debian.org
Subject: Dealing with spam on the mailing lists
From: <nyov@nexnode.net>
Date: Tue, 24 Dec 2019 05:15:28 +0000
Message-id: <[🔎] 47hkv54xTTzlfhw9@a.mx.nexnode.net>
In-reply-to: <[🔎] 20191219221132.GD6617@tack.einval.com>
References: <[🔎] 20191219221132.GD6617@tack.einval.com>

On Thu, 19 Dec 2019 22:11:32 +0000
Steve McIntyre <steve@einval.com> wrote:

> Hey folks,

Hello all!

> So, I've spoken to the listmaster team about making this list
> "moderated" rather than "open". What does that mean? Mails from
> subscribers would go through to the list; mails from non-subscribers
> would be held back for moderation by humans. We'd need some
> responsible people to act as moderators, and I volunteer to help
> here.

> What do you think? Is this the right thing to do?

I feel that moderating the list, out of a ton of debian MLs, is taking a
very localized view on the problem. Certainly, other MLs could follow
the path to moderation; but I feel the human-resource drain for
moderation seems significant, and interest in doing the work - possibly
temporary.
Of course it looks quite necessary in the face of not *having* a
workable solution, but it would be hiding the bigger issue from public
view, at best, so I'd prefer an automated solution, if possible.

Human moderation of every unsubbed email, continuously, looks to me like
an uphill battle against resource deprivation.
And if the mod stops moderating (for lack of time, etc.), we wouldn't
even notice until a non-subscriber eventually complains through a
different channel? Additionally, delays due to TZ-differences,
availability etc. of the moderators, are likely to happen.

But we do know blacklisting is doomed to fail, and a whitelist solution,
like moderation, seems like the only real answer in today's email
landscape. The only question is of it's feasibility in the given
environment.

Here is my off-the-cuff response to the problem. It looks feasible to
me, but without knowing the debian ML situation intimately, I can't
tell for certain. So far it is only a thought-experiment.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

I propose a global double-opt-in approach for unknown email-addresses
to debian's Mailing-Lists as a non-human moderation process.

This system would replace the currently optional
<whitelist et lists.d.o> system (https://lists.debian.org/whitelist/)
with a _required_ opt-in. (And it could start by importing subscribed
addresses from that list.)

Consider it a "lightweight" registration process, an opt-in without
at the same time subscribing to any specific debian ML:

[1] The first time an email address is seen by the ML-system, bendel, a
 |  confirmation-request mail is sent back (the double opt-in), to
 |  confirm the validity of the address, as well as the sender's
 |  intent.
 |  This is done using a canned email-template, which can be
 |  replied to just like any response from <*-request@lists.d.o>.
 |  It could include a reminder of the ML CoC, do and dont's[1].

[2] At this point the address is stored in a lookup-table on the
 |  Mail-system -for all MLs- as "pending" or "greylisted" until the
 |  opt-in mail is confirmed. (So that further mails from this possibly
 |  spammy address to any debian ML don't result in backscatter spam
 |  by repetitive confirm requests to some unsuspecting address).
 |
 |  Further, the initially received mail (and any further mail sent
 |  in the meantime) is placed on hold in the mailer queue, and
 |  released upon opt-in confirmation from the greylisted address.
 |
 |  [2b] After some time has elapsed without confirmation (perhaps a
 |  week), the held mail gets zapped by a cronjob cleanup run or some
 |  such; and the address times out from the "greylist" lookup table.
 |
 |  This greylist measure should probably run after any DNSBL and
 |  other spam-checks (such as SA and greylisting) already weeded out
 |  the chaff, just before the mail would have been forwarded to
 |  SmartList. (This should mean the hold queue would have approximately
 |  the amount of mails currently seen on the lists as spam, per week.)

[3] Once the confirmation reply arrives, the email address is moved
 |  over to a global whitelist table, which removes the need for new
 |  opt-in confirmations from this email address, when sending to
 |  *any* debian ML, from then on.
 |
 |  Further, the hold on all mail from this address in the mailer queue
 |  is released.
 |
 |  [3b] The whitelist table lookup shouldn't be very resource
 |  intensive, so it could run before some of the other content checks,
 |  and a positive hit could negate the need for the more
 |  resource-intensive SA run, thus taking load off the system.
 |  (If that feels too permissive, it could at least be a powerful
 |  weight in the SA rules.)

[4] Should a stored mail address ever become spammy, it would need to
 |  be removed from the whitelist. This should be the only task
 |  requiring human intervention in this system (as opposed to
 |  moderation / per email / per list). But maybe I'm overly optimistic.

In terms of implementation, I feel this could be a lightweight
solution. It could be written as a postfix access table check and
milter; or in the case of exim here(?), a milter program would work for
that MTA as well, I believe?

The lookup-tables (pending, whitelist) might be simple k/v stores as
sqlite db, bdb hash, lmdb, ldap lookup, etc.

If you feel the whitelist table would become excessively large, a
marisa-trie or DAWG could alleviate the problem.[2][3]

Instead of simply looking at addresses, it could also be a two-step
sender-domain -> address lookup, as basis for a domain-reputation system
at some later point perhaps.
And if debian implements a reputation-based system, the opt-in could
either supplement it or be bypassed, based on sender domain reputation.

If having all those addresses stored in a central location is a
concern, the addresses could be hashed.
Then again, they're all publicly scrape-able from the web-interface,
so I can't see that being a serious blocker in the current environment.

The only pitfall I can see so far, is the required integration with
SmartList registrations. Requests to <*-request@ML> should be excluded
from an opt-in check; but so should already registered ML subscribers.
As I don't know the internals there, I don't know if the necessary
cross-check would be easy to implement.

Also, addresses in the whitelist should be cleaned up in accord with
the MLs; when an email is no longer subscribed to any ML (unsubscribed
by the user, or automatically because of bounced mail).

[1] https://www.debian.org/MailingLists/#codeofconduct
[2] https://en.wikipedia.org/wiki/Trie
[3] https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

So I feel this looks like a good solution in a Mailing-List
environment, which already works on a subscription (whitelist) model
and where an opt-in procedure is known and can be expected from users.
Yet I'm not seeing it in existence already -- so I have quite likely
overlooked _something_ crucial!
So please be kind in your burning of this idea. Thanks.

Cheers, and best wishes for a happy Christmas.

nyov

Reply to:

Follow-Ups:
- Re: Dealing with spam on the mailing lists
  - From: Alexander Wirt <formorer@formorer.de>
- Re: Dealing with spam on the mailing lists
  - From: Paul Wise <pabs@debian.org>

References:
- Dealing with the spam on the debian-www list
  - From: Steve McIntyre <steve@einval.com>

Prev by Date: Russian word in the English version of the site
Next by Date: Re: Dealing with spam on the mailing lists
Previous by thread: Re: Dealing with the spam on the debian-www list
Next by thread: Re: Dealing with spam on the mailing lists
Index(es):
- Date
- Thread