Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
On Sat, 10 May 2025 at 01:22, Russ Allbery <rra@debian.org> wrote:
>
> Matthias Urlichs <matthias@urlichs.de> writes:
>
> > I'm not disputing any of that. *Of course* we should write our rules and
> > laws to benefit humans / humanity, not robots or AIs or corporate
> > profiteering or what-have-you.
>
> > All I'm saying is that the idea "a human can examine a lot of
> > copyrighted stuff and then produce non-copyrighted output but a computer
> > cannot" might still hold some water today, but the bucket is leaky and
> > getting leakier every couple of months, if not weeks.
The thing is - I do not believe that "bucket" has ever existed.
If you look at it starting from fundamentals, I believe that it
becomes very obvious.
Say for example the output of:
$ cat /usr/share/common-licenses/GPL-2 | wc --bytes
18092
Is that number a derived work of the GPL licence? It is not. In fact
it is not creative or expressive enough to even have copyright.
aigarius@home:~$ cat /usr/share/common-licenses/GPL-2 | wc --words
2968
Same here.
$ sha256sum /usr/share/common-licenses/GPL-2
8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643
/usr/share/common-licenses/GPL-2
Again - not really copyrightable and not a derivative work.
And how about:
$ cat /usr/share/common-licenses/GPL-2 | tr ' ' '\12' | tr 'A-Z' 'a-z'
| sort | uniq -c | sort -nr | head
503
194 the
106 to
104 of
72 you
64 and
63 or
55 a
52 is
50 program
The words themselves are not copyrightable - we already have a bunch
of wordlist packages in Debian main. And their frequencies in the
document are no different from the wc example above, just after
filtering.
AI training and LLM training uses the same kind of statistical
transformations, just a couple steps more advanced than this, like
tracking word pairs or tracking the statistical chance of one word
following another if a certain third word appears in the attention
context window. But always an intermediate step is a bunch of
non-copyrightable statistical data.
So yes - computers *can* examine a copyrighted work and produce a
non-copyrighted result. The only thing that is changing is that this
non-copyrighted result is becoming more and more complex and also more
and more useful.
As soon as you have a single intermediate step where the copyright of
the source does not survive to the output of that step (and assuming
you only use the output of that step and no data from previous steps
further in your pipeline), then *regardless* of further processing,
the copyright of the input date before that step no longer matters.
The result may acquire a new copyright (yours) as you do something
creative enough with it, or amass sufficient amount of it to qualify
for a database copyright. But the copyright of the training data
simply does not survive the step of completely destructive statistical
analysis.
--
Best regards,
Aigars Mahinovs
Reply to: