On Mon, Apr 28, 2025 at 01:47:09PM -0600, Gunnar Wolf wrote: > Wait — Training data are chunks of software. I understand where you are > getting to, but in order to redistribute it, we must have the right to. How > do we say that training data are "immutable and uncopyrightable facts of > world and nature"? The heavily trained machine didn't learn from objects > randomly happening in nature... The current running theory among most FOSS legal scholars I've spoken to is that works solely generated by AI are non-copyrightable, at least in the US, and hence in the public domain. (Not because they are "facts", but because they are generated by a machine.) Under that hypotheses, both we and our users will enjoy all traditional DFSG freedoms on generated LLM content. Exceptions to the above are: (1) when there is significant creative contribution by the AI user, but in that case the generated work is copyrighted by the AI *user*, not by the copyright owners of material present in the training dataset, and (2) when the output is a verbatim copy of some training input (sometimes referred to as "recitation"). Note that recitation is nowadays something that commercially deployed LLMs like Copilot are really good at *not* doing anymore, by applying plagiarism/code-clone detection techniques between the generated output and the training dataset *before* returning any output to the user. AFAIK this legal theory has not been tested in court yet. But the big commercial players (who, remember, have vetted interests in being copyright absolutists) believe in it so much, that they go as far as offering legal indemnity promises to users of their LLMs who encounter legal issues due to the use of generated output. (Copilot does this, provided that the protections against recitation are not disabled; they are enabled by default.) So I strongly advise that we do *not* base our voting decisions, or strategic considerations for free software, on the hypothesis that LLM outputs are derived works under copyright law of the training datasets in the general case. Doing so is currently at high risk of exploding in our hands spectacularly. (Note: we might decide to take the stance that we *treat* LLM output as if it were derived work under copyright of training datasets. What we should not do is anchoring that decision into copyright law determinations about this specific point.) Cheers -- Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._ Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'
Attachment:
signature.asc
Description: PGP signature