[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



Hi Aigars,

On Sun, May 04, 2025 at 02:27:46PM +0200, Aigars Mahinovs wrote:
>    On Sun, 4 May 2025 at 13:12, Wouter Verhelst <[1]wouter@debian.org>
>    wrote:
> 
>      On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote:
>      >    However, here we have a clear and fundamental change happening
>      in the
>      >    copyright law level - there is a legal break/firewall that is
>      happening
>      >    during training. The model *is* a derivative work of the source
>      code of
>      >    the training software, but is *not* a derivative work of the
>      training
>      >    data.
>      I would disagree with this statement. How is a model not a
>      derivative
>      work of the training data? Wikipedia defines it as
> 
>    The simple fact that none of the LLMs have been sued out of
>    existence by any copyright owner is de facto proof that it does not
>    work that way in the eyes of the judicial system.

This statement is inaccurate, incorrect, and irrelevant.

It is inaccurate, because the legal system does not work that way: the
legality of an action is not defined by the presence or absense of a
lawsuit pertaining to that action. If it were, then any cold case in the
history of mankind must therefore by definition have been legal. More
to the point, in this particular case the lack of lawsuits could be
explained by a variety of factors, including but not limited to the
indifference of the grieved party; the inability to finance a lawsuit
against "big tech" companies such as microsoft or facebook; or the
believe on the side of the grieved party that they may not have a case
in the first place, even when they might have won would they have filed
suit.

It is incorrect, because the New York Times did in fact file suit
against Microsoft, OpenAI, and other parties related to copyright
infringement of their large library of news articles in creating
ChatGPT[1]. The case is still in court.

It is irrelevant, because in a Debian context, the law is relevant only
to the point that we must obey it in relevant jurisdictions[2]. It does not
have any say over how we define our own rules and ethics. If we decide
as Debian that we believe the training data is in fact part of the
source of a model, then we can in fact set such a rule. We do not just
follow the law in deciding what to distribute and how to do it; if this
were the case, then there would never have been any need for a non-US,
non-free, or non-free-firmware section of our archive, and the DFSG
would have been just this little thing, you know.

[1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/
[2] where I define "relevant" as "any jurisdiction where not obeying the
    law could result in significant problems for Debian", which in
    practice probably means the US and most of Europe.

>    Wikipedia definition is a layman's simplification.

It may be a simplification, but that in and of itself does not make it
incorrect.

I do think that a model is in fact a derivative work of the training
data, because of the fact that you use the training data to build the
model, and that without that training data, the model would be different
and it would not act the same.

Is that a legal definition? No. Is it a basis on which we could define
our own rules and ethics? Sure is.

Thanks,

-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.


Reply to: