[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Sat, 10 May 2025 at 01:22, Russ Allbery <rra@debian.org> wrote:
>
> Matthias Urlichs <matthias@urlichs.de> writes:
>
> > I'm not disputing any of that. *Of course* we should write our rules and
> > laws to benefit humans / humanity, not robots or AIs or corporate
> > profiteering or what-have-you.
>
> > All I'm saying is that the idea "a human can examine a lot of
> > copyrighted stuff and then produce non-copyrighted output but a computer
> > cannot" might still hold some water today, but the bucket is leaky and
> > getting leakier every couple of months, if not weeks.

The thing is - I do not believe that "bucket" has ever existed.

If you look at it starting from fundamentals, I believe that it
becomes very obvious.

Say for example the output of:
$ cat /usr/share/common-licenses/GPL-2 | wc --bytes
18092

Is that number a derived work of the GPL licence? It is not. In fact
it is not creative or expressive enough to even have copyright.

aigarius@home:~$ cat /usr/share/common-licenses/GPL-2 | wc --words
2968

Same here.

$ sha256sum /usr/share/common-licenses/GPL-2
8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643
/usr/share/common-licenses/GPL-2

Again - not really copyrightable and not a derivative work.

And how about:

$ cat /usr/share/common-licenses/GPL-2 | tr ' ' '\12' | tr 'A-Z' 'a-z'
| sort | uniq -c | sort -nr | head
    503
    194 the
    106 to
    104 of
     72 you
     64 and
     63 or
     55 a
     52 is
     50 program

The words themselves are not copyrightable - we already have a bunch
of wordlist packages in Debian main. And their frequencies in the
document are no different from the wc example above, just after
filtering.

AI training and LLM training uses the same kind of statistical
transformations, just a couple steps more advanced than this, like
tracking word pairs or tracking the statistical chance of one word
following another if a certain third word appears in the attention
context window. But always an intermediate step is a bunch of
non-copyrightable statistical data.

So yes - computers *can* examine a copyrighted work and produce a
non-copyrighted result. The only thing that is changing is that this
non-copyrighted result is becoming more and more complex and also more
and more useful.

As soon as you have a single intermediate step where the copyright of
the source does not survive to the output of that step (and assuming
you only use the output of that step and no data from previous steps
further in your pipeline), then *regardless* of further processing,
the copyright of the input date before that step no longer matters.
The result may acquire a new copyright (yours) as you do something
creative enough with it, or amass sufficient amount of it to qualify
for a database copyright. But the copyright of the training data
simply does not survive the step of completely destructive statistical
analysis.





--
Best regards,
    Aigars Mahinovs


Reply to: