[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Fri, 9 May 2025 at 16:26, Wouter Verhelst <w@uter.be> wrote:
> This is what I'm trying to say, and you're not going to convince me that
> something can go into main because of any argument that is based on "the
> law".

In the end all of the DFSG is depending on the specific provisions of
the copyright law. DFSG sets out criteria, like ability to make and
distribute modifications, but it is the specific licensing and its
specific interpretation via the copyright law is what ultimately
determines if the criteria set out in the DFSG are also actually
realised in practice.

> In my opinion, a model is not free if we don't have the rights to build
> that model, and if we don't have the rights to redistribute everything
> that is needed to build that model. Anything else fails DFSG1, DFSG2,
> DFSG7 and DFSG8, and it *does not matter* whether copyrights attached to
> those files transfer to the model or not.

It does not violate any of that. You *do* have the rights to build
that model. Training data is not source code, it is an intermediate
build artifact. Information about how to get the training data is the
only unique data that is necessary to build that model.

1 - Nothing restricts the redistribution of either final model (and
software required to use it) nor of the actual source of the whole
system, including the description on what external data it was trained
on. (this is where it becomes *critical* on what the actual law says
about this!)
2 - All source code is there. The description on what training data
was and how to get it is the ultimate source code for training. Having
that is better and more useful for modification than just having a
terrabyte blob of training data. If the "prefered form of
modification" is the ultimate criteria for what is "source code" then
training data is not source code. Data scientists do not sit around
modifying training data files or tarballes - they change ingest
criteria and re-run the whole pipeline. They change the training data
*information*, not the training data.
7 - completely irrelevant - the program and model are free software
with their own licenses. (again, here what the law says about this is
critical)
8 - just as irrelevant - there is nothing Debian specific in any of this.

If you take a free AI model and train it on different data than it was
trained originally (but same kind of data, be it source code or
English prose), then you will get in the end the same kind of AI
model. Just its capabilities and quality will be proportional to the
breath and quality of the inputs you provided to train it. There is no
restriction to your freedoms in any of it.

The proposal does not *fully* satisfy the "The Desert Island test" -
as, if you were on a desert island and still somehow had a datacenter
with couple hundred million worth of NVidia AI processor cards, you
could not fully recreate the same exact model *just* from the contents
of the Debian sources archive. But if you had an Internet connection
or a dump of the Internet Archive or the dumb/torrent of whatever
training corpus was used to train the model .. you could. And it would
be legal for you to do so.

It is not a bad idea to contain absolutely everything needed to
recreate all software in Debian inside the Debian sources archive. But
I believe it *is* a bad idea to exclude data (or even software
requiring such data) that is based on knowledge that lies outside the
Debian sources archive, such as language models or even auto-complete
models (which are based on language), such as spam classifiers (which
are based on spam email contents), such as virus definitions (which
are based on computer viruses), such as star maps or other maps (which
are based on real-world locations of things), such as OCR models
(which are based on many samples of written texts), ...

Data is not software. Knowledge is not software. Other rules can and
should apply to it. Also in the way that DFSG is interpreted in that
context.
-- 
Best regards,
    Aigars Mahinovs


Reply to: