[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [RFCv3] Counter-Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Sat, 10 May 2025 at 01:17, Thorsten Glaser <tg@debian.org> wrote:
> >I realized that I have one additional generic concern: You claim that
> >models are a derivate work of their training input.
>
> Yes. This is easily shown, for example by looking at how they work,
> https://explainextended.com/2023/12/31/happy-new-year-15/ explained
> this well, and in papers like “Extracting Training Data from ChatGPT”.
> It is a sort of lossy compression that has shown to be sufficiently
> un-lossy enough (urgs, forgive my lack of English) that recognisable
> “training data” can be recalled, and the operators’ “fix” was to add
> filters to the prompts, not to make it impossible, because they cannot.

That is both false and misleading. A compression, even a lossy
compression would have a correlation of the content of the output with
content of *one* particular input. That is compression.

An algorithm that only stores and produces an *average* value across a
wide set of inputs can not be any kind of compression. It is data
mining.

In all provided examples the claim that the model reproduces a
particular input is just false. All it does is continue an already
started text in a way that is most probable across *all* inputs. That
probability only shows similarity to a specific input in two cases -
if your input is especially constructed to be absolutely unique and
only match the one input document (which means that the copyright
violation has already happened in the *question* that you are entering
into the LLM) or if the same text or expression is widely spread
across many input documents and is in fact a common representation of
a fact (like "The capital of Spain is ... Madrid"). The chances of a
LLM reproducing one specific input document decreases as you increase
the training base and in the end is the same as the chances of a human
being accidentally writing or composing something that is actually a
copy of some other work (whether they have seen it before or not,
whether they remember seeing it or not) - that is also seen as
copyright violation for the human. You can see this all the time in
the law suites about similarities in music. And in the same way,
humans are encouraged not to do this. And the same way there is no
guarantee that something written by a human is not an illegal
reproduction of a copyrighted material.

As soon as you transform an input document into statistical
probabilities (which is not a reversible transformation and its output
bears no resemblance to the input document), there is no more
copyrightable content there and no derived work. As soon as one step
produces an non-copyrightable intermediate product, the chain of
copyright derivative work is broken and copyright of the training data
no longer applies. This is well established in the context of data
mining. It is even worse - there is a *lot* of fair use freedom in
data mining, see Fox News v. TVEYES, Inc., 43 F. Supp. 3d 379 (S.D.
N.Y. 2014) for example.

And yes - a simple, one way data transformation by software does
destroy copyright. It is trivial to see from simplest examples and
then go up. If I run "wc" on a copyrighted work, the number of words
in the document is *not* a derived work from the original document.
There is simply not enough creative expression for copyright law to
even apply to this simple integer. Same if I make a sha265 checksum of
the document - the checksum is not a copyright object and is not a
derived work. Same happens if I count the number of occurance of
individual words (words and their lists are non-copyrightable as well,
we have wordlists inside Debian already). Same if I calculate the
probabilities of one word following another word. And that is
basically what a LLM is - a list of propabilities.

If your entire proposal is based on this assumption about how
copyright and copyright law works, I would expect something more
substantial, like court decisions supporting this radical new
interpretation. And overturning things like Article 4 of EU Directive
2019/790 granting near complete copyright exception to text and data
mining. This was explicitly referred to in EU AI Directive in context
of the use of training data. And overturning of a *ton* of already
decided "fair use" cases in USA.


Reply to: