[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



Aigars Mahinovs dijo [Mon, Apr 28, 2025 at 12:03:19PM +0200]:
IMHO we are here having a very annoying mixture of technical, legal and
philosophical problems.

Hypothetical 1: Bob reads all programming manuals and all DFSG-free code in
Debian and GitHub and teaches themselves Python programming. They are asked
to solve a simple problem. Their answer basically matches sample solutions
from a few Python coding manuals.

Can Bob release this solution as DFSG-free code? Does it matter if the
specific programming manual or Python course manual was DFSG-free licensed
or not? Does it matter if the manual had a GPL license? What if they
learned in a university setting?

Hypothetical 2: An abstract AI Alice does the same learning process as Bob
and produces the same output in an answer to the same request.

Do the conditions on the output of Alice change? Is the change technical or
legal/philosophical? You could call this a Turing test for copyright.

The main difference between Bob and Alice is not judgeable(?) based on the
different sets of outputs they will emit, but on the legal recognition of
the (alleged?) author's personhood: Bob will be recognized as a person, and
as such, the code he emits will be copyrighted to him. Alice lacks
personhood; everything it emits is just the output of a machine. And, yes,
attribution is very hard to assert for whatever code it produces.

Naturally, if Alice were to be trained by the works of Shakespeare, it is
very unlikely she would be able to output proper Python, unlike the
situation you describe.

Note, of course, that Bob's personhood does not exempt him from plagiarism,
or from re-creating non-copyrighteable trivial code. For example, I am
writing the following from my internal neural network:

    #include <stdio.h>
    void main() {
        printf("Hello world!\n");
    }

Is that code mine? No. Is that code copyrightable to me, given it is a fact
I emitted it from my previous training? No. That's too trivial to
copyright.

Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's done
by a human or a software system. The copyright law makes a distinction here
for humans. And while no explicit legal precedent is yet set for any kinds
of AI (including LLMs), the very lack of massive copyright violation
lawsuits from very sue-happy corporations, like Disney, is already a
noteworthy precedent. If LLMs from Meta and OpenAI (and others) are not
being sued for massive copyright violations, then it is the consensus of
our society and of our legal system that the same kind of expert opinion /
learning protections that humans enjoy also seem to apply to complex-enough
artificial expert systems. One hand-wavy legal loophole could be that the
learning process splits the copyrighted works into chunks small enough that
none of those chunks would legally retain the copyright protection anymore.
But that is just one of many speculations until a law or a court
establishes such guidelines.

Right. Oh, but we are very, very, very good at extending our "knowledge"
(training? Inexpert training?) of legal uses of our favored licensing
schemes that we want to look at everything base on our learnt patterns...

What does that mean in terms of this proposal (or a potential alternative
proposal)?

If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary
that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be
 DFSG-free or the manual process for it to be documented

In this perspective, we would be seeing the training data itself as
immutable and uncopyrightable facts of world and nature, like positions and
spectra of stars in the sky (because its copyright does not survive the
learning process). It is data that can be gathered again, maybe with slight
variation in results and it does not really change based on who does the
gathering (assuming similar resources get invested).

Wait — Training data are chunks of software. I understand where you are
getting to, but in order to redistribute it, we must have the right to. How
do we say that training data are "immutable and uncopyrightable facts of
world and nature"? The heavily trained machine didn't learn from objects
randomly happening in nature...

Greetings,


Reply to: