[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Mon, Apr 28, 2025 at 01:47:09PM -0600, Gunnar Wolf wrote:
> Wait — Training data are chunks of software. I understand where you are
> getting to, but in order to redistribute it, we must have the right to. How
> do we say that training data are "immutable and uncopyrightable facts of
> world and nature"? The heavily trained machine didn't learn from objects
> randomly happening in nature...

The current running theory among most FOSS legal scholars I've spoken to
is that works solely generated by AI are non-copyrightable, at least in
the US, and hence in the public domain. (Not because they are "facts",
but because they are generated by a machine.) Under that hypotheses,
both we and our users will enjoy all traditional DFSG freedoms on
generated LLM content.

Exceptions to the above are: (1) when there is significant creative
contribution by the AI user, but in that case the generated work is
copyrighted by the AI *user*, not by the copyright owners of material
present in the training dataset, and (2) when the output is a verbatim
copy of some training input (sometimes referred to as "recitation").
Note that recitation is nowadays something that commercially deployed
LLMs like Copilot are really good at *not* doing anymore, by applying
plagiarism/code-clone detection techniques between the generated output
and the training dataset *before* returning any output to the user.

AFAIK this legal theory has not been tested in court yet. But the big
commercial players (who, remember, have vetted interests in being
copyright absolutists) believe in it so much, that they go as far as
offering legal indemnity promises to users of their LLMs who encounter
legal issues due to the use of generated output. (Copilot does this,
provided that the protections against recitation are not disabled; they
are enabled by default.)

So I strongly advise that we do *not* base our voting decisions, or
strategic considerations for free software, on the hypothesis that LLM
outputs are derived works under copyright law of the training datasets
in the general case. Doing so is currently at high risk of exploding in
our hands spectacularly.  (Note: we might decide to take the stance that
we *treat* LLM output as if it were derived work under copyright of
training datasets. What we should not do is anchoring that decision into
copyright law determinations about this specific point.)

Cheers
-- 
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CSO Software Heritage            o o o     o       /\|^|/\
Mastodon: https://mastodon.xyz/@zacchiro                        '" V "'

Attachment: signature.asc
Description: PGP signature


Reply to: