[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Non-LLM example where we do not in practice use original training data



On Mon, May 05, 2025 at 02:13:58PM -0700, Russ Allbery wrote:
> However, I am very leery about extending that exception to cases where
> people are intentionally creating that situation by deleting the input
> data on purpose.

I agree with you on this. I do wonder however where you would place the
case where the training data is available (possibly: publicly
available), and the model trainers would even want to distribute it, but
cannot due to unclear licensing terms. Would you say that it is a "less
nasty" case than that where training data is deleted on purpose, or
would you consider it as bad?

FWIW, in terms of free software ethics, I consider non-open data to be
"less nasty" than non-free code. That's because with code we can take
the activist approach of just rewriting it under a free software license
(provided enough development resources are available). With non-open
data, there are cases in which you cannot just recreate and release it
under a free license, no matter how many resources you have.

The ability to exploit non-open-data to serve the needs of free software
(as it would be the case with DFGS-free models, trained on non-DFSG-free
data) is something I hesitate giving up on.

Cheers
-- 
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CSO Software Heritage            o o o     o       /\|^|/\
Mastodon: https://mastodon.xyz/@zacchiro                        '" V "'

Attachment: signature.asc
Description: PGP signature


Reply to: