Re: Non-LLM example where we do not in practice use original training data

To: Russ Allbery <rra@debian.org>
Cc: debian-vote@lists.debian.org
Subject: Re: Non-LLM example where we do not in practice use original training data
From: Aigars Mahinovs <aigarius@gmail.com>
Date: Fri, 9 May 2025 23:44:45 +0200
Message-id: <[🔎] CABpYwDX3VtuS3mzBY6n23+LeavvHBbRX=ECvcXFXPdeYk6MQSA@mail.gmail.com>
In-reply-to: <[🔎] 878qn5emhc.fsf@hope.eyrie.org>
References: <[🔎] aBzv8vl6ZBGXhPqV@scru.org> <[🔎] 87bjs3q6p0.fsf@hope.eyrie.org> <[🔎] CABpYwDW+RDpmQ=DmNQRRu-8oQ-B=ESot_XB+fjRevQwk3MrMKQ@mail.gmail.com> <[🔎] 878qn5emhc.fsf@hope.eyrie.org>

On Fri, 9 May 2025 at 19:13, Russ Allbery <rra@debian.org> wrote:
>
> Aigars Mahinovs <aigarius@gmail.com> writes:
>
> > Just because something can be done cheaper or at scale with help of
> > automation does not make the method of automation for it to become
> > morally wrong. See torrent, see mass manufacturing techniques that allow
> > factories in China to make millions of knock-offs of known toys.
>
> I'm sorry, I flatly and completely disagree with this as a general
> statement. There are indeed some things that do become wrong because they
> are done at scale.

I think this argument could be better with some concrete examples.

> This is part of what it means to live in a society: we have to balance
> good and harm and put some thoughtful rules in place around what part of
> someone's work becomes fair use and what part of someone's work remains
> under their control. Those are necessary compromises within our current
> economic and political system if we want people to be able to afford to
> make new work, if we want to avoid fraud and misrepresentation, and if we
> want to respect the human dignity of artists and their right to be
> associated with their work and to *not* be associated with things that are
> *not* their work.
>
> I am extremely sympathetic to the argument that copyright as currently
> designed does not succeed in balancing these factors correctly. It
> certainly has a wealth of problems. But you will never have my support for
> simply breaking it and to hell with the consequences and anyone who gets
> hurt in the process.

There is nothing "breaking" or radical about using fair use exceptions
to gather, learn and use knowledge. We have done this since the dawn
of copyright with the only difference that so far this has been mostly
done by either humans or very specialised research-focused PhD
projects. It does not mean that it was not done at scale - billions of
people read stuff written by other people, get inspired, write
fanfiction and create new works. The whole art, science and culture
world lives from copying stuff and squinting sideways at copyright law
most of the time. If you want to make a honest living with art, you
make a unique piece and sell it to people who love it, as directly as
possible. You don't go around suing everyone who saw your painting and
then are trying to draw something similar for their bathroom. That is
not how a healthy society works.

Reinforcing the concept that copyright does *not* survive the learning
step (unless the output is a sufficiently identical copy of the
original aka "transformation" requirement) is an important step for
any kind of knowledge economy to survive.

This is already a key part of the EU law, establishing clear
exceptions to copyright for data mining applications and linking AI
training to that specific exception. It also addresses the _consent_
concern that was expressed in another thread. The EU law expresses
this in a form of a machine-readable exception that a copyright owner
(of a publicly published work) can choose to set on their work to
exempt it from automatic indexing. I do not know if real
implementations of this exist as of now.

If we, as open source community, start to propose or promote
*actually* problematic ideas, like that LLM training is "equivalent to
compression" or that all outputs of any algorithm (in this case LLM
learning, but the legal theory does not really distinguish) are
derivative copyright works of all inputs, then this is propagating
legal ideas that can *actually* hurt a lot of people and even destroy
open source as such. For example, anything edited by a GPL editor
becomes GPL. So is anything edited in a BSD editor if it also has a
GPL-licensed file in another tab. All you need to do is to use
automatic API code completion. The code completion algorithm reads
both non-GPL and GPL file and so anything that it writes is a derived
work from a GPL file. It sounds absurd, but it can get far, far worse
if copyright maximalism escalates.

The law does not really care if an action is done by a human directly
or with help of a technical measure.

Do you think you own the copyright to the code you just wrote? Do you?
Don't you remember that you read a programming language manual a few
years ago that had an example code that was suspiciously similar to
parts of the code you've just written? And that manual had a
commercial license. That's it. That is enough.

I do not want the copyright landscape to devolve into the same madness
that is prevalent in the software patent world where everyone is very
*explicitly* avoiding learning about any kinds of solutions invented
by other people because *knowingly* violating a patent is three times
as expensive as reinventing it by accident.

...

But all that is just a gross exaggeration of the effect on the world
that any decision by Debian or even the open source community as a
whole could realistically cause. But so is "smashing" the copyright
system.

All we can *really* do is to try to define what is a "free" AI - an AI
that gives its users the freedoms of the Debian Social Contract (and
still is a useful AI) and hope that this guideline becomes a desirable
attribute that a part of AI developers will spend extra resources
trying to hit, hopefully causing more free AIs being created than
without such Debian action.

The thing is, if we define it too strictly, not only are we giving
more ammo to copyright maximalists, but we are also guaranteeing that
all "free AI"s will be useless research toys and all the actual AIs we
will be calling non-free. Regardless of how free their software is.
Regardless of if there is a description of the training data or not.
If there is no difference in freedom between a bunch of "non-free"
AIs, why wouldn't the users choose the most effective AI? The one that
is most likely to also be the most non-free one? We will in the end
actually promote more usage and development of non-free software.

At that point it is better for Debian to make no statements at all.
Luckily that seems to be the current status quo.
-- 
Best regards,
    Aigars Mahinovs

Reply to:

References:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Clint Adams <clint@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Aigars Mahinovs <aigarius@gmail.com>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Previous by thread: Re: Non-LLM example where we do not in practice use original training data
Next by thread: Re: Archive section for open source models
Index(es):
- Date
- Thread