Re: Non-LLM example where we do not in practice use original training data
Ansgar 🙀 <ansgar@debian.org> writes:
> On Mon, 2025-05-05 at 14:27 -0600, Sam Hartman wrote:
>> If I wanted to package up my classifier state and distribute it under a
>> free software license, I think it should be DFSG free. I think that to
>> satisfy the DFSG I would need to include all the training data I still
>> had and any scripts I used.
> And the training data would have to be under a DFSG-free license. I
> doubt phishing or spam mail comes with proper licensing; even ham
> doesn't do this (what are the license terms of this mail?). So if you
> were required to include training data it wouldn't be possible even for
> fairly boring classifiers.
Debian is not required to be a distribution point for every type of
software or database file that people have thought of. I don't believe
that a Bayesian spam filter database trained in this way is DFSG-free, and
I don't think it should be included in Debian main.
That doesn't mean I think it's bad or immoral or anything like that. I
have a database like that myself. :) It's simply not free software, and is
outside the scope of what Debian is for. Not even all of Debian's own data
is free software. For example, I would not consider the BTS database or
the mailing list archives to be free software because the licensing status
is not sufficiently clear.
There is lots of useful software and data in the world that is not free
software, and there are lots of other projects that can distribute it.
Obviously, this is just my opinion, and I realize other people disagree.
--
Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>
Reply to: