Non-LLM example where we do not in practice use original training data

To: debian-vote@lists.debian.org
Subject: Non-LLM example where we do not in practice use original training data
From: Sam Hartman <hartmans@debian.org>
Date: Mon, 05 May 2025 14:27:27 -0600
Message-id: <[🔎] tslecx2dcr4.fsf@suchdamage.org>


I think many of us modify machine learning models on a regular basis.
And I think when we make those modifications, we do not go back to
original training data, but instead, we modify the model weights.

I suspect I am not the only one who uses rspamd and who  uses both the
Bayesian classifier and the neural network classifier, both of which are
machine learning models.

My point here is that there a common case where the preferred form of
modification for a model definitely is not the original training data.
Some people on the list probably do retain all the messages they submit
for learning.
I know I do not.
(I retain a significant subset and probably could reproduce something if
I had to.)

If I wanted to package up my classifier state and distribute it under a
free software license, I think it should be DFSG free.
I think that to satisfy the DFSG I would need to include  all the
training data I still had and any scripts I used.
But I think in that circumstance the model weights would be a reasonable
preferred form of modification.
If the way I responded to bug reports was to manually run messages
through rspamc, I think that ought to be DFSG free based on decisions we
have made in similar circumstances in the past.

I appreciate that coming up with a classifier state that was generic
enough to be valuable to package in Debian would be difficult. However,
I think this serves as an example we can all get our heads around to see
that in practice, real users do often use model weights as the preferred
form of modification.

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Ansgar 🙀 <ansgar@debian.org>

Prev by Date: Re: Proposal A Amendament (was: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
Previous by thread: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
Next by thread: Re: Non-LLM example where we do not in practice use original training data
Index(es):
- Date
- Thread