Re: Non-LLM example where we do not in practice use original training data
Stefano Zacchiroli <zack@debian.org> writes:
> On Mon, May 05, 2025 at 02:13:58PM -0700, Russ Allbery wrote:
>> However, I am very leery about extending that exception to cases where
>> people are intentionally creating that situation by deleting the input
>> data on purpose.
I should say explicitly that I jumped quite a bit down a slippery slope in
this reply to Sam to make a rhetorical point, and there is a VAST excluded
middle between "training data available" and "training data intentionally
deleted to avoid having to disclose it."
In most cases, I suspect the real situation will be more that the training
data was just unmanageably large and the people doing the training saw no
reason to retain it because they considered it easier to, for instance,
scrape the web again than to keep all the data on hand. This is Sam's
point, as I understand it, and it's entirely valid *if* you believe that
the DFSG provision for source code is primarily about putting everyone,
including upstream, on an equal footing. I agree that if the training data
was never kept or intended to be kept, upstream is clearly indicating that
they don't consider it the "preferred form of modification" and they are
on equal footing with everyone else.
I do *not* consider putting everyone on equal footing to be the only or
even the primary goal of the requirement to have source code. I am
concerned about other ethical issues such as transparency and auditability
that come with source code.
> I agree with you on this. I do wonder however where you would place the
> case where the training data is available (possibly: publicly
> available), and the model trainers would even want to distribute it, but
> cannot due to unclear licensing terms. Would you say that it is a "less
> nasty" case than that where training data is deleted on purpose, or
> would you consider it as bad?
I think it's clearly less bad in some sense, in that there isn't the
feeling of someone gaming the system and thus I'm less leery of their
motives. This case is instead the far more familiar and typical case that
free software encounters all the time: portions of the source are under
unclear licenses and are not clearly DFSG-free.
No one in those situations is doing anything wrong, in my opinion, but we
still don't allow such software into Debian main because we are a free
software distribution and that is not free software. There are other forms
of good in the world besides free software, and I am very glad there are
other organizations to pursue them, but I don't see the justification for
Debian to expand its scope.
> FWIW, in terms of free software ethics, I consider non-open data to be
> "less nasty" than non-free code.
I agree with this in terms of ethics, but I think they're equivalent in
terms of what we put in Debian main.
> The ability to exploit non-open-data to serve the needs of free software
> (as it would be the case with DFGS-free models, trained on non-DFSG-free
> data) is something I hesitate giving up on.
Well, first, I continue to object to the idea that a model can be
DFSG-free if it's trained on non-DFSG-free data. I think that makes it
definitionally non-free. (I have read Aigars's arguments to the contrary
and do not find them at all persusasive.)
But, more directly to your point, I agree with you, but I don't understand
why this implies that it's necessary to put non-free data in Debian main.
I can exploit all sorts of non-open data from my Debian computer by
obtaining it from any number of other sources. I don't see the need for
Debian to host it.
--
Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>
Reply to: