On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote: > The transformative criteria here is that the resulting work needs to be > transformed in such a way that it adds value. And generating new texts > from a LLM is pretty clearly a value-adding transformation compared to > the original articles. Even more so than the already ruled-on Google > Books case. OK, let me change it around a bit, because I don't think this discussion is going in any direction that is relevant for Debian. The only way in which you can build a model is by taking loads and loads of data, running some piece of software over it, and storing the result somewhere. How can we do this legally, reproducibly, and openly if we do not have the rights to redistribute the said "loads and loads of data"? The answer is, we can't. Therefore, I conclude that, practically, we cannot include models in Debian if we want them to be reproducible. I think we have a goal, as a project, to make Debian reproducible. I think the reproducibility of our software -- *all* our software, not just the programs and libraries but also the data -- is an important goal with important repercussions. Dropping that goal would be required if we were to accept models in main. We also declared, over 20 years ago, that "Debian will remain 100% free". Not just the programs and libraries, but *everything* -- also the data. Ergo, either we need to drop our dual goals of becoming reproducible and remaining Free, or models without training data can't go into main. This to me is one practical reason as to why training data is part of the source code of a model: because without it, we can't build the model and we can't build it reproducibly. The fact that the model does something vaguely and remotely similar to a biological process of training and learning in humans, and that therefore some people have taken to naming the process of running advanced statistical analysis over data to build such a model also "training" is a red herring. The two processes are very different and cannot be compared as a practical matter. I have noticed that you have gone ahead and proposed an alternative ballot option to accept this misguided idea that the training data is not source. I can't tell you how much I disagree with that option. The OSI has already taken steps to legitimize this blatantly obviously wrong idea; if Debian were to follow down those tracks (and this would definitely do that, IMO), I have to seriously reconsider whether I want to still be a part of this. Thanks. -- w@uter.{be,co.za} wouter@{grep.be,fosdem.org,debian.org} I will have a Tin-Actinium-Potassium mixture, thanks.
Attachment:
signature.asc
Description: PGP signature